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ABSTRACT 

\  J 

PASH,  a  large-scale  mult imicroprocessor  system  being  designed  at  Purdue 
University  for  image  processing  and  pattern  recognition,  is  described.  This 
system  can  be  dynamically  reconfigured  to  operate  as  one  or  wore  independent 
SIND  and/or  MIND  machines.  PASH  consists  of  a  Parallel  Computation  Unit 
which  contains  N  processors,  N  memories,  and  an  interconnect  ion  network; 
Micro  Controllers,  each  of  which  controls  N/0  processors;  N/Q  parallel 
secondary  storage  devices;  a  distributed  Memory  Management  System;  and  a 
System  Control  Unit,  to  coordinate  the  other  system  components.  Possible 
values  for  N  and  0  are  1024  and  16,  respectively.  The  control  schemes,  in¬ 
terprocessor  communications,  and  memory  management  in  PASH  are  explored. 
Examples  of  Sow  PASH  can  be  used  to  perform  image  processing  tasks  are 
given. 
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VII.  KEEERENCES 


T.  INTRODUCTION 


As  a  result  of  the  microprocessor  revolution,  it  is  now  feasible  to 
build  multimicroprocessor  systems  capable  of  performing  image  processing 
tasks  more  rapidly  than  previously  possible.  There  are  many  image  process¬ 
ing  tasks  which  can  be  performed  on  a  parallel  processing  system,  but  are 
prohibit i vely  expensive  to  perform  on  a  conventional  computer  system  due  to 
the  large  amount  of  time  required  to  do  the  tasks  T34"i.  In  addition,  a  mul- 
t imi c roprocessor  system  can  use  parallelism  to  perform  the  real  time  image 
processing  required  for  such  applications  as  robot  (machine)  vision,  au¬ 
tomatic  guidance  of  air  and  space  craft,  and  air  traffic  control. 

There  are  several  types  of  parallel  processing  systems.  An  SIND 
(single  inst  rue  t ion  st  ream  -  multiple  data  stream)  machine  M21  typically 
consists  of  a  set  of  N  processors,  N  memories,  an  interconnect  ion  network, 
and  a  control  unit  (e.g.  Illiac  IV  [7]).  The  control  unit  broadcasts  in¬ 
structions  to  the  processors  and  all  active  ("turned  on”)  processors  execute 
the  same  instruction  at  the  same  time.  Each  processor  executes  instructions 
using  data  taken  from  a  memory  with  which  only  it  is  associated.  The  fntei — 
connection  network  allows  interprocessor  communication.  An  NS I HD 
(multiple-SIND)  system  is  a  parallel  processing  system  which  can  be  struc¬ 
tured  as  two  or  more  independent  SIND  machines  (e.g.  NAP  C25,263).  The  Il¬ 
liac  IV  was  originally  designed  as  an  HSIND  system  Ttl.  An  HIND  (multiple 
instruct  ion  stream  -  mul t ipl e  data  st  ream)  machine  r12]  typically  consists 
of  N  processors  and  N  memories,  where  each  processor  can  follow  an  indepen¬ 
dent  instruction  stream  (e.g.  C.mmp  T661).  As  with  SIND  architectures, 
there  is  a  multiple  data  stream  and  an  interconnection  network.  A 
part i t ionable  SIND/HIND  system  r 46}  is  a  parallel  processing  system  which 


can  be  structured  as  two  or  more  independent  SIND  and/or  NIND  machines  (e.g 
PASH  C3B]) .  ZZ 


PASH,  a  part 1 t ionable  £IH0/HIHD  machine,  is  a  large-scale  dynamically 
reconf igurable  multimicroprocessor  system  being  developed  at  Purdue  Univer¬ 
sity  r38, 43, 44, 46-493.  It  is  a  special  purpose  system  being  designed  to  ex¬ 
ploit  the  parallelism  of  image  processing  and  pattern  recognition  tasks.  It 
can  also  be  applied  to  related  areas  such  as  speech  processing  and  biomedi¬ 
cal  signal  processing.  In  this  paper,  the  architecture  of  PASH  is  presented 
and  examples  of  its  use  in  performing  image  processing  tasks  are  given. 

Due  to  the  low  cost  of  microprocessors,  computer  system  designers  have 
been  considering  various  multimicrocomputer  architectures,  such  as 
[6,8,18,21,22,29,60,63,643.  PASH  was  the  first  system  in  the  literature  to 
combine  the  following  features: 

(1)  it  can  be  partitioned  to  operate  as  many  independent  SIHD  and/or  HIHD 
machines  of  varying  sizes,  and 

(2)  a  variety  of  problems  in  image  processing  and  pattern  recognition  will 
be  used  to  guide  the  design  choices. 

It  was  not  until  the  current  "microprocessor  revolution"  that  designing 
a  system  w:th  as  many  as  1024  full  processors  was  feasible.  Hany  designers 
have  discussed  the  possibilities  of  building  large-scale  parallel  processing 
systems,  employing  2^  to  21*  microprocessors,  in  SIHD  (e.g.  binary  n-cube 
array  C293)  and  HIHD  (e.g.  CHoPP  C60,613)  configurations.  Without  the  pres¬ 
ence  of  su.h  a  large  number  of  processors,  the  concept  of  partitioning  the 
system  into  smaller  machines  which  can  operate  as  SIHD  or  HIHD  machines  was 
unnecessary.  Nutt  1253  has  suggested  a  machine  which  is  a  multiple-SIHD 
system.  Lipovski  and  Tripathi  [223  have  considered  the  idea  of  combining 
the  SIHD  and  HIHD  modes  of  operation  in  one  system.  In  addition,  develop¬ 
ments  in  recent  years  have  shown  the  Importance  of  parallelism  to  image  pro¬ 
cessing,  using  both  cellular  logic  arrays  (e.g.  CLIP  C563,  BASE  8  T323)  and 


SIND  systems  (e.g.  STARAN  C333).  A  variety  of  such  systems  are  discussed  In 
C13"i.  Thus,  the  time  seems  right  to  Investigate  how  to  construct  a  computer 
system  such  as  PASH:  a  machine  which  can  be  dynamically  reconfigured  as  one 
or  more  SIND  and/or  NINO  machines,  optimized  for  a  variety  of  Important  Im¬ 
age  processing  and  pattern  recognition  tasks. 

The  use  of  parallel  processing  In  Image  processing  has  been  limited  in 
the  past  due  to  cost  constraints.  Host  systems  used  small  numbers  of  pro¬ 
cessors  (e.g.  Illiat  IV  C73),  processors  of  limited  capabilities  (e.g. 
STARAN  C333),  or  specialized  logic  modules  (e.g.  PPM  n91).  With  the 
development  of  microprocessors  and  related  technologies  it  is  reasonable  to 
consider  parallel  systems  using  a  large  number  of  complete  processors. 

SIND  machines  can  be  used  for  ''local"  processing  of  segments  of  images 
in  parallel.  Tor  example,  the  image  can  be  segmented,  and  each  processor 
assigned  a  segment.  Then,  following  the  same  set  of  instructions,  such 
tasks  as  line  thinning,  threshold  dependent  operations,  and  gap  filling  can 
be  done  in  parallel  for  all  segments  of  the  image  simultaneously.  Also  in 
SIND  mode,  matrix  arithmetic  used  for  such  tasks  as  statistical  pattern 
recognition  can  be  done  efficiently. 

NIND  machines  can  be  used  to  perform  different  "global"  pattern  recog¬ 
nition  tasks  in  parallel,  using  multiple  copies  of  the  image  or  one  or  more 
shared  copies.  For  example,  in  cases  where  the  goal  Is  to  locate  two  or 
more  distinct  objects  in  an  image,  each  object  can  be  assigned  a  processor 
or  set  of  processors  to  search  for  it. 

There  are  also  tasks  which  require  parallel  processing  in  both  SIND  and 
NIND  modes.  As  a  simple  example  consider  the  task  of  determining  if  a  line 
drawing  contains  a  square.  In  SIND  mode  a  parallel  processing  system  can 
segment  the  image  and  each  processor  can  locally  determine  which  points  in 


its  segment,  if  any,  are  possible  corners  of  squares.  The  system  can  then 
switch  to  MIKD  mode,  where  each  corner  will  be  assigned  to  a  processor  which 
examines  the  image  globally  to  determine  if  the  corner  is  actually  part  of  a 
square.  Another  SIMD/HIND  application  might  involve  using  the  same  set  of 
microprocessors  for  preprocessing  an  image  in  SIMD  mode  and  then  doing  a 
pattern  recognition  task  in  MIMD  mode. 

Figure  1  is  a  block  diagram  of  the  basic  components  of  PASN.  The 
Paral lei  Computation  Unit  (PCU)  contains  N  processors,  N  memory  modutes,  and 
an  interconnect  ion  network.  The  PCU  processors  are  microprogrammable  mi¬ 
croprocessors  that  perform  the  actual  SIMD  and  MIWD  computat ions.  The  PCU 
memory  modules  are  used  by  the  PCU  processors  for  data  storage  in  SIMD  mode 
and  both  data  and  instruction  storage  in  NIMD  mode.  The  interconnect  ion 
network  provides  a  means  of  communication  among  the  PCU  processors  and 
memory  modules.  Two  possible  ways  to  organize  the  PCU  and  different  types 
of  part i t ionable  networks  which  can  be  used  are  described  in  section  II. 

The  Wi cro  Controllers  (HCs)  are  a  set  of  microprogrammable  microproces¬ 
sors  which  act  as  the  control  units  for  the  PCU  processors  in  SIMD  mode  and 
orchestrate  the  activities  of  the  PCU  processors  in  MIMD  mode.  There  are  3 
HCs.  Each  MC  controls  N/Q  PCU  processors.  A  virtual  SIMO  machine  of  size 
RN/Q  where  R  =  2r  and  1  <  r  <  q,  is  obtained  by  loading  R  me  memory  modules 
with  the  same  instructions  simultaneously.  Similarly,  a  virtual  NIIRO 
machine  of  size  RN/3  is  obtained  by  combining  the  efforts  of  the  PCU  proces¬ 
sors  of  R  KCs.  Possible  values  for  N  and  Q  are  1024  and  16,  respect i vely. 
Control  Storage  contains  the  programs  for  the  NCs.  The  WCs  are  discussed  in 
more  detail  in  section  III. 

The  Memory  Management  System  controls  the  loading  and  unloading  of  the 
PCU  memory  modules.  It  employs  a  set  of  cooperating  dedicated  microproces- 
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sors.  The  Memory  Storage  System  stores  these  files.  Multiple  devices  are 
used  to  allow  parallel  data  transfers.  The  secondary  memory  system  is 
described  in  section  IV. 

The  System  Control  Unit  is  a  conventional  machine,  such  as  a  PDP-11, 
and  is  responsible  for  the  overall  coordination  of  the  activities  of  the 
other  components  of  PASM.  Examples  of  the  tasks  the  System  Control  Unit 
wit l  perform  include  program  development,  job  scheduling  for  all  of  the  PCU, 
and  coordination  of  the  loading  of  the  PCU  memory  modules  from  the  Memory 
Storage  System  with  the  loading  of  the  MC  memory  modules  from  Control 
Storage.  1y  carefully  choosing  which  tasks  should  be  assigned  to  the  System 
Control  Unit  and  which  should  be  assigned  to  other  system  components  (such 
as  the  Memory  Management  System),  the  System  Control  Unit  can  work  effec¬ 
tively  and  not  become  a  bottleneck. 

Together,  sections  II,  III,  and  TV  present  the  overall  architecture  of 
PASM.  ^articular  attention  is  paid  to  the  ways  in  which  the  control  struc¬ 
ture,  Interprocessor  communi cat  ions,  and  secondary  memory  scheme  allow  PASM 
to  be  efficiently  partitioned  into  independent  virtual  machines.  Variations 
in  the  design  of  PASM's  PCU  which  still  support  these  control,  communica¬ 
tions,  and  secondary  memorj  ideas  are  examined.  This  examination  demon¬ 
strates  how  the  concepts  underlying  PASM  can  be  used  in  the  design  of  dif¬ 
ferent  systems. 

In  section  V,  image  processing  algorithms  using  PASM  are  presented.  In 
particular,  smoothing,  histogram  calculations,  and  the  two-dimensional  ffT 
are  examined.  Using  these  examples,  the  potential  improvement  a  system  such 
as  PASM  can  provide  over  serial  machines  is  demonstrated. 


II.  parallel  computation  unit 


A^  PCU  Organization 

The  Parallel  Computation  Unit  (PCU)  contains  processors,  memories  and 
an  interconnect  ion  network.  One  configuration  of  these  components  is  to 
connect  a  memory  module  to  each  processor  to  form  a  processor  -  memory  pair 
called  a  processing  element  (PE) .  The  interconnection  network  is  used  for 
communi cat  ions  between  PEs.  This  configuration  is  called  PE-to-PE  and  is 
shown  in  Figure  2.  A  pair  of  memory  units  is  used  for  each  memory  module. 
This  double-buf fering  scheme  allows  data  to  be  moved  between  one  memory  unit 
and  secondary  storage  (the  Memory  Storage  System)  while  the  processor 
operates  on  data  in  the  other  memory  unit.  In  the  PE-to-PE  configuration, 
memory  references  are  relatively  fast,  however  the  transfer  of  large  blocks 
of  data  from  processor  to  processor  is  delayed  by  the  memory  fetching  and 
storing  which  must  oe  done. 

The  P-to-M  (processor-to-memory)  configuration,  shown  in  Figure  3,  uses 
the  interconnection  network  to  connect  the  processors  to  the  memory  modules. 
Again,  double-buf fer ing  is  employed.  With  the  P-to-M  structure,  every 
memory  reference  must  travel  through  the  interconnect  ion  network.  To  fetch 
an  operand  from  memory,  the  processor  must  first  send  the  address  of  the 
operand  through  the  interconnection  network  to  a  memory.  Then,  the  proces¬ 
sor  receives  the  operand  from  the  memory  via  the  interconnection  network. 
Advantages  of  the  P-to-M  configuration  are  that  a  memory  connected  to  a  pro¬ 
cessor  can  be  reconnected  to  another  processor,  effectively  transfering  the 
entire  contents  of  the  memory  from  one  processor  to  another,  and  that  the 
ncmber  of  memories  does  not  have  to  be  equal  to  the  number  of  processors 
(e.g.  9Sp  C91).  A  disadvantage  is  that  all  memory  references  must  go 
through  the  interconnect  ion  network. 


A  more  detailed  analysis  reveals  some  of  the  tradeoffs  involved  in 


these  t vo  conf iguat ions .  If  T  is  the  time  required  for  a  memory  access 

n  r 

(either  a  read  or  a  write),  and  T.  is  the  time  to  send  a  data  item  through 

’  in 

the  interconnect  ion  network,  then  the  time  required  for  a  memory  reference 

in  the  P-to-fl  configuration,  T  ,  given  by 

(1)  T  „  *  T.  ♦  T  ♦  T  , 

P-M  in  mr  in 

T.  must  be  included  twice,  once  for  the  processor  to  send  the  address  to 
in 

the  memory  and  once  for  transferring  the  data.  (The  time  required  for  con¬ 
trolling  the  network  is  omitted  since  control  methods  vary.) 

For  the  PE-to-PE  configuration  the  time  required  for  a  memory  refer¬ 
ence,  Tp(-'  depends  on  the  locat  ion  of  the  memory  which  is  to  be  used.  If 
the  memory  is  local,  then 

(3)  T  =  T  . 

PE  mr 

If  the  memory  is  connected  to  some  other  processor,  then 

(3)  T  *  T.  ♦  T  ♦  T.  . 

PE  in  mr  in 

Again  7,  is  included  twice.  In  this  case,  the  first  T.  represents  the 
in  in 

time  required  to  transfer  the  address  from  the  PE  requesting  the  data  item 

t 

to  the  PE  which  has  the  data  item.  T  represents  the  time  required  for  the 

i  r 

PE  which  has  the  data  item  to  recogni/e  and  service  the  data  request.  This 
may  requi re  a  significantly  longer  delay  than  T^.  The  second  T^  is  the 
time  required  to  transfer  the  data  item.  If  p  is  the  probability  of  a  local 
memory  reference,  then  (2)  and  (3)  can  be  combined  to  give  the  expected 
memory  reference  time 

(A)  ETT  1  »  pT  ♦  d“p)(T.  ♦  T*  ♦  T.  >. 

PE  mr  in  mr  in 


Comparing  (1)  and  (A), 


(5)  T  >  EFT  1 

p-n  -  pe 

for  p  sufficiently  large.  Thus,  the  "best"  conf igurat ion  is  task  dependent . 


When  operating  in  SIMD  node  with  the  PE-to-PE  configuration,  it  is  of¬ 


ten  possible  to  omit  one  occurrence  of  T,  in  (T)  and  reduce  T  to  T 

in  mr  mr 

This  is  done  by  computing  the  address  of  the  desired  data  in  the  processor 
connected  to  the  memory  to  be  accessed.  Thus,  (4)  reduces  to 


(6) 


EfT  1  =  pT  ♦  (1-p)<T  *T .  ). 
f*E  mr  mr  in 


Therefore,  when  operating  in  SIKD  mode  the  PE-to-PE  conf igurat i on  is  prefer¬ 
able. 


When  operating  in  KIND  mode,  the  PE-to-°E  configuration  requires  that 
two  processors  be  involved  in  every  non-local  memory  reference.  The  efforts 
of  two  processors  involved  in  a  data  transfer  can  be  co-ordinated  by  having 
the  processor  which  initiates  the  transfer  interrupt  the  other  processor  or 
by  dedicating  on*  of  these  processors  to  handling  data  transfers.  In  the 
P-to-n  configuration,  the  memories  are  shared  by  the  processors,  i.e.,  more 
than  one  processor  can  access  the  same  memory  for  either  data  or  instruc¬ 
tions.  However,  for  the  image  processing  tasks  that  have  been  examined, 
most  data  and  instructions  can  be  stored  in  the  local  memory,  reducing  the 
impact  of  this  cons iderat ion. 

The  PE-to-PE  conf igurat ion  will  be  used  in  PASM.  Depending  on  the  ap¬ 
plication  for  which  a  different  part i t ionable  SIHD/RIND  system  is  intended, 
the  P-to-N  conf igurat ion  may  be  preferable.  The  interconnection  networks, 
control  structure,  and  secondary  memory  system  described  below  can  be  used 
in  conjunction  with  either. 


0 .  Interconnect  ion  Networks 

One  of  the  major  problems  in  designing  parallel  processing  systems  is 
the  construction  of  an  interconnection  network  to  provide  interprocessor 
communications.  In  this  section  a  variety  of  rec i rcul at i ng  (single  stage) 
and  multistage  networks  are  described.  Each  of  the  networks  discussed  is 


compatible  with  the  control  and  memory  management  schemes  presented  later. 
Thus,  depending  on  the  intended  applications  of  a  particular  part i t ionabl e 
SI^D/WIND  machine  being  designed,  any  one  of  these  networks  can  be  used  with 
into  the  overall  system  structure  described  in  this  paper. 

Formally,  an  interconnection  network  is  a  set  of  i nterconnec t ion  func¬ 
tions  FT63.  Each  interconnect  ion  function  is  a  bijection  (permutation)  on 
the  set  of  N  input/output  addresses,  the  integers  from  0  to  N-1 .  The  inter¬ 
connection  function  f  connects  input  i  to  output  f(i),  0  £  i  <  N.  Several 

types  of  interconnec t ion  networks  are  discussed  below,  using  p  ...p  p  to 

n-i  i  u 

denote  the  binary  representation  of  an  arbitrary  input/output  address,  p.  to 
denote  the  complement  of  p.,  and  n  a  logp  N. 

The  Cube  network  consists  of  the  n  functions  defined  by: 

c'jbfi!pn-r,.pl*1pipi-r,,p0)  =  Pn-1***PinPiPi-1***P0 

for  0  <  i  <  n  ("*67.  Typically,  a  multistage  Cube  network  consists  of  n 
stages  of  switches,  where  stage  i  implements  the  cube,  interconnection  func¬ 
tion.  The  topology  of  the  indirect  binary  n-cube  network  ^291  is  shown  in 
Figure  <• .  The  switches,  called  interchange  bo*es,  can  be  in  either  the 
straight  or  exchange  state  as  shown.  Each  interchange  bo*  is  i ndependent l y 
controlled.  Information  on  the  capabilities  of  this  network  can  be  found  in 
r297.  The  STARAN  flip  network  CA,5T  is  also  based  on  the  Cube  interconnec¬ 
tion  functions  and  its  capabilities  are  a  subset  of  those  of  the  binary  n- 
cube  TSOI. 

The  Omega  network  F2Q7,  shown  in  Figure  5,  is  a  multistage  implementa¬ 
tion  of  the  "Shuf f le-E*change"  T5A7.  In  this  network,  an  interchange  bo* 
can  be  in  either  the  straight,  exchange,  lower  broadcast,  or  upper  broadcast 
state  as  shown.  Each  interchange  bo*  is  independently  controlled.  As  in 
the  indirect  binary  n-cube.  Stage  i  of  the  Omega  network  can  implement  cube. 


r40,5(n.  The  order  of  the  stages  of  these  two  networks  Is  reversed,  but  for 
the  following  discussion  on  partitioning  these  differences  are  irrelevant, 
and  both  will  be  referred  to  as  multistage  Cube  networks. 

Consider  partitioning  a  Cube  network  into  2°  r  subnetworks,  each  of 
which  has  2r  input-.;  and  outputs.  This  can  be  accomplished  by  grouping  all 
of  the  input  and  output  lines  which  agree  in  of  their  address  bits.  For  ex¬ 
ample,  consider  grouping  together  all  input  and  output  lines  which  have  the 
same  n-r  most  significant  address  bits  into  a  partition.  To  implement  this, 
force  the  interchange  boxes  of  the  r  through  n-1  stages  to  be  in  the 
straight  state.  This  prevents  data  transfers  between  processors  whose  ad¬ 
dresses  differ  in  the  n-r  most  significant  bits,  thus  preventing  any  commun¬ 
ication  between  partitions  T503.  Each  subnetwork  has  the  same  properties  as 
the  whole  network,  so  any  subnetwork  can  be  further  subdivided  into  smaller 
partitions.  Thus,  varying  sizes  of  partitions  which  are  powers  of  two  can 
be  formed  T533.  'or  example,  for  N316,  one  partitioning  of  the  network  is 
into  subnetworks  of  sizes  3,  4,  2,  and  2. 

»  rec i rcul at ing  (single  stage)  network  based  on  the  Cube  interconnec¬ 
tion  network  can  be  modeled  using  the  input  and  output  selectors  in  Figure  6 
r403.  Conceptually,  for  each  x,  0  £  x  <  n,  input  selector  x  is  connected  to 
output  selector  cube.(x>,  0  £  1  <  n,  and  output  selector  x  is  connected  to 
input  selector  cube.(x),  0  £  i  <  n.  Each  pass  through  the  network  allows 
the  implementation  of  one  Cube  function.  In  this  implementation,  partition¬ 
ing  is  achieved  by  preventing  the  input  selectors  from  using  all  of  their 
output  lines,  and  by  preventing  the  output  selectors  from  using  all  of  their 
input  lines.  That  is,  if,  for  example,  all  of  the  addresses  of  the  proces¬ 
sors  in  a  partition  agree  in  the  n-r  most  significant  bits,  then  these  pro¬ 
cessors  should  be  prevented  from  using  the  input  and  output  lines  which  im- 


plement  the  cube  «  through  cube  Interconnect  ion  functions. 
n-1  r 

The  Plus-Minus  2*  (PM2I)  network  consists  of  the  2n-1  interconnect  ion 
functions  defined  by: 

PM2^(j)  3  j*2^  modulo  N 
PM2  (j>  3  j-21  modulo  N 

for  Q£i<n,  Q£j<N  T361.  Mote  that  *  PH2_^_  .  Feng’s 
data  manipulator  network  [111  consists  of  n  stages  of  connections,  where 
stage  i  implements  the  two  functions  PM?^ .  and  PM2  as  shown  in  Figure  7. 
Each  stage  is  controlled  by  only  a  pair  of  signals.  An  augmented  dat a 
manipulator  (ADM)  network  is  a  data  manipulator  with  individual  switch  con¬ 
trol,  i.e.,  each  switch  can  get  any  of  the  signals  H  (straight),  U  (?M2_.), 
and  D  (PM2^).  The  capabilities  of  the  ADM  are  a  superset  of  those  of  the 
Omega  network  f 40, 501. 

Like  the  Cube  network,  the  ADM  network  can  be  partitioned  into  subnet¬ 
works  of  varying  sizes  which  are  powers  of  two  T533.  In  this  case,  to  form 
2n  r  independent  subnetworks  of  size  ?r,  the  input  and  output  lines  which 
have  the  same  n-r  teast  significant  address  bits  are  grouped  into  a  parti¬ 
tion.  This  is  implemented  by  forcing  the  switches  of  stages  0  through  n-r 
into  the  H  state.  Since  each  subnetwork  has  the  properties  of  the  whole 
network,  any  subnetwork  can  be  further  subdivided  into  smaller  partitions. 

A  recirculating  network  based  on  the  PM2I  functions  can  be  modeled  us¬ 
ing  input  and  output  selectors  shown  in  Figure  6.  In  this  case,  for  each  x, 
0  <  x  <  n,  input  selector  x  is  connected  to  output  selector  PM2^(x)  and 
PM2  ^(x),  0  <  i  <  n,  and  output  selector  x  is  connected  to  input  selector 
PM?  ,(x)  and  PM?^,(x),  0  <  1  <  n.  This  single  stage  network  can  be  parti¬ 
tioned  in  a  manner  similar  to  the  multistage  ADM  network  r53T. 
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The  multistage  networks  described  above  can  be  constructed  in  two  ways: 
as  combinat ional  logic  or  as  a  pipeline  C533.  In  the  pipeline  construction, 
registers  are  placed  between  stages,  increasing  the  speed  of  block 
transfers.  In  the  following  sections,  it  will  be  assumed  that  the  proces¬ 
sors  will  be  partitioned  such  that  their  addresses  agree  in  the  low-order 
bit  positions,  so  that  either  PM2I  or  Cube  type  networks  can  be  used.  Thus, 
any  of  the  part i t ionable  networks  discussed  above  can  be  used  in  the  PCU. 
For  PASM,  the  multistage  pipelined  ADM  network  is  being  considered  because 
of  its  speed  and  flexibility.  However,  a  multistage  pipelined  Cube  network 
may  be  sufficient  for  the  system's  needs.  The  tradeoffs  are  currently  under 
invest igat ion.  lore  information  about  these  networks  can  be  found  in 
T2S, 36, 39, 41, 42, 50, 51, 53-553. 

III.  MICRO  CONTROLLERS 

A .  Address ing  Convent  ions 

In  order  to  have  a  partit ionable  system,  some  form  of  multiple  control 
units  must  be  provided.  In  PASN,  this  is  done  by  having  Q=2q  Micro  Con¬ 
trollers  (MCs),  physically  addressed  (numbered)  from  0  to  Q-1 .  Each  MC  con¬ 
trols  N/Q  PCU  processors,  as  shown  in  Figure  8. 

An  MC  is  a  microprogrammable  microprocessor  which  is  attached  to  a 
memory  module.  Each  memory  module  consists  of  a  pair  of  memory  units,  "A" 
and  "8,”  so  that  memory  loading  and  computations  can  be  overlapped.  In  SIMD 
mode,  each  MC  fetches  instructions  from  its  memory  module,  executing  the 
control  flow  Instructions  (e.g.  branches)  and  broadcasting  the  data  process¬ 
ing  instructions  to  its  PCU  processors.  The  physical  addresses  of  the  N/Q 
processors  which  are  connected  to  an  MC  must  all  have  the  same  low-order  q 
bits  so  that  the  network  can  be  partitioned.  The  value  of  these  low-order  q 
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bits  is  the  physical  address  of  the  MC.  A  virtual  SIMD  machine  of  size 
RN/Q,  where  R  3  2r  and  1  £  R  <  Q,  is  obtained  by  loading  R  *Cs  with  the  same 
instructions  and  synchroni z i ng  the  HCs.  The  physical  addresses  of  these  HCs 
must  have  the  same  low-order  q-r  bits  so  that  all  of  the  PCU  processors  in 
the  partition  have  the  same  low-order  q-r  physical  address  bi*s.  Similarly, 
a  virtual  *IMD  machine  of  size  RN/Q  is  obtained  by  combining  the  efforts  of 
the  PCU  processors  associated  with  R  NCs  which  have  the  same  low-order  q-r 
physical  address  bits.  In  NIMD  mode,  the  MCs  may  be  used  to  help  coordinate 
the  activities  of  their  PCU  processors. 

In  each  partition  the  PCU  processors  and  memory  modules  are  assigned 
logical  addresses.  Given  a  virtual  machine  of  size  RN/Q,  the  processors  and 
memory  modules  for  this  partition  have  logical  addresses  (numbers)  0  to 
(RN/Q)  -  1,  R  =  ?r,  0  £  r  <  q.  Assuming  that  the  NCs  have  been  assigned  as 
described  above,  then  the  logical  number  of  a  processor  or  memory  module  is 
the  high-order  r*n-q  bits  of  the  physical  number.  Recall  that  all  of  the 
physical  addresses  of  the  processors  in  a  partition  must  have  the  same  q-r 
low-order  bits.  For  example,  for  N  *  n?A,  Q  =  16,  and  R  =  A,  one  allowable 
choice  of  processors  to  form  a  partition  of  size  RN/Q  is  those  whose  physi¬ 
cal  addresses  are  3,  7,  11,  15,. ..1023.  The  high-order  r*n-q  =  3  bits  of 
these  10-bit  physical  addresses  are  0,  1,  2,  3, ...255,  respectively.  The 
value  of  the  low-order  q-r  *  2  bits  of  all  of  these  physical  processor  ad¬ 
dresses  is  equal  to  three. 

Similarly,  the  NCs  assigned  to  the  partition  are  logically  numbered 
(addressed)  from  0  to  R-1 .  For  R  >  1,  the  logical  number  of  an  NC  is  the 
high-order  r  bits  of  its  physical  number.  Recall  all  of  the  physical  ad¬ 
dresses  of  the  NCs  in  a  partition  must  agree  in  the  low-order  q-r  bits.  For 
R  =  1,  there  is  only  one  NC  and  it  is  considered  logical  number  0.  For  ex- 


ample,  if  N  3  1924,  Q  =  16,  and  R  *  4,  one  allowable  choice  of  four  MCs  is 
those  whose  physical  addresses  are  3,  7,  11,  and  15.  The  high-order  r  -  2 
bits  of  these  four  bit  physical  addresses  are  0,  1,  2,  and  3,  respectively. 
The  value  of  the  low-order  q-r  »  2  bits  of  all  the  physical  MC  addresses  is 
equal  to  three. 

The  PASH  language  compilers  and  operating  system  are  used  to  convert 
from  logical  to  physical  addresses.  Thus,  a  system  user  deals  only  with 
logical  addresses. 

9 .  Commun icat ions  with  the  System  Cont rol  Unit 

When  large  SIND  jobs  are  run,  that  is  jobs  which  require  more  than  N/Q 
processors,  more  than  one  MC  executes  the  same  set  of  instructions.  Since 
each  MC  has  its  own  memory,  if  more  than  one  MC  is  to  be  used,  then  several 
memories  must  3e  loaded  with  the  same  set  of  instructions.  Another  occasion 
which  requires  several  MC  memories  to  contain  the  same  instructions  is  when 
the  same  program  is  to  be  run  for  several  different  sets  of  data.  For  exam¬ 
ple,  suppose  Q  different  images  are  to  be  processed  using  a  program  that  re¬ 
quires  N/Q  processors  for  each  image.  Each  MC  will  be  executing  the  same 
program,  but  each  will  be  working  on  a  different  image. 

The  fastest  way  to  load  several  MC  memories  with  the  same  set  o*  in¬ 
structions  is  to  load  all  of  the  memories  at  the  same  time.  This  can  be  ac¬ 
complished  by  connecting  the  Control  Storage  to  all  of  the  MC  memory  modules 
via  a  bus  as  shown  in  Figure  9.  Each  memory  module  is  either  enabled  or 
disabled  for  loading  from  Control  Storage,  depending  on  the  contents  of  a  0 
bit  mask  register  called  the  Micro  Controller  Memory  Load  (MCML)  register. 
NC  memory  module  i  is  enabled  if  the  1th  bit  of  the  MCML  register  is  a  "1," 
otherwise  it  is  disabled.  The  Q  bit  Micro  Controller  Memory  Select  (MCMS) 


register  selects  which  memory  unit  an  MC  processor  should  use  for  instruc- 


tions.  A  "1"  in  the  1th  bit  wea^i  MC  processor  i  is  to  use  its  A  memory 
unit,  while  a  "0"  in  the  1th  bit  weans  it  is  to  use  its  B  memory  unit.  An 
enabled  memory  unit  not  being  used  by  an  MC  processor  receives  the  data  from 
Control  Storage. 

The  Q  bit  Micro  Cont rol ler  Status  (MCS)  register  contains  the  go/done 
status  of  the  MCs.  A  "0"  in  the  i*^  bit  means  that  MC  i  is  done.  When  the 
ith  bit  is  set  to  a  "I,"  the  MC  sets  its  program  counter  to  zero  and  begins 
executing  or  broadcasting  the  contents  of  the  memory  unit  that  is  specified 
by  the  MCMS  register.  When  the  MC  is  done,  it  sets  its  bit  of  the  MCS  re¬ 
gister  to  ”0"  and  sends  an  interrupt  to  the  System  Control  Unit  to  inform  it 
that  an  MC  has  finished. 

C.  Communicat ions  Among  Micro  Cont  rol lers 

Instructions  which  examine  the  collective  status  of  all  of  the  PEs  of  a 
virtual  SIMD  machine  include  "if  any,”  "if  all,"  and  "if  none.”  These  in¬ 
structions  change  the  flow  of  control  of  the  program  at  execution  time 
depending  on  whether  not  any  or  all  processors  in  the  SIMD  machine  satisfy 
some  condition.  For  example,  if  each  PE  is  processing  data  from  a  different 
radar  unit,  but  all  PEs  are  looking  for  enemy  planes,  it  is  desirable  for 
the  control  unit  to  know  "if  any"  of  the  PEs  has  discovered  a  possible  at¬ 
tack. 

The  task  of  computing  an  "if  any"  type  of  statement  that  involves 
several  MCs  can  be  handled  using  the  existing  hardware.  This  can  be  accom¬ 
plished  by  having  the  PCU  processors  associated  with  these  MCs  use  a  recur¬ 
sive  doubling  type  of  algori  ». 

Specialized  hardware  to  handle  this  task  can  be  added  inexpensively  and 
will  result  in  a  faster  solution  than  the  software  approach.  This  approach 
js"s  a  bus  which  connects  the  MCs  and  requires  that  each  MC  have  access  to 
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the  job  denti f ication  number  (ID)  for  the  job  it  is  running.  The  ID  for  a 
job  on  an  (1C  nay  range  from  0  to  2Q-1  since  there  can  be  at  most  ?Q  jobs, 
i.e.,  one  in  each  of  the  2Q  memory  units.  The  Micro  Control i er 
Communication  Bus  (MCCB)  consists  of  an  ID  bus  and  a  data  bus.  When  an  "if 
any"  type  instruction  is  encountered,  each  MC  associated  with  the  job  sends 
a  request  to  the  bus  controller  to  use  the  communi cat  ion  bus.  When  one  of 
the  MCs  becomes  the  first  item  in  the  queue  the  bus  controller  sends  that  MC 
a  “permission  to  use  the  bus"  signal  so  the  MC  may  broadcast  its  job  ID  to 
all  of  the  MCs  (including  itself)  via  the  q*1  bit  MCCB  ID  bus.  If  an  MC  is 
running  the  job  with  the  ID  which  is  on  the  ID  bus  it  then  puts  its  local 
results  onto  the  MCCB  data  bus.  The  MCCB  data  bus  is  one  bit  wide  and  will 
be  constructed  using  "wired  and"  technology,  i.e.,  the  bus  is  a  B  input 
"wired  and”  gate.  This  allows  all  of  the  MCs  associated  with  a  job  to  put 
their  data  on  the  bus  simultaneously.  Then,  while  all  this  information  is 
on  the  bus,  all  of  the  MCs  associated  with  the  job  read  the  data  and  take 
the  appropriate  action.  Each  MC  serviced  removes  itself  from  the  MCCB 
queue . 

For  example,  in  the  case  of  the  "if  any"  instruction,  when  the  job  ID 
appears  on  the  ID  bus  each  MC  puts  its  local  results  on  the  data  bus.  A  "1" 
is  sent  if  none  of  its  PCU  processors  met  the  condition,  a  "0"  is  sent  if 
any  of  its  PCU  processors  met  the  condition.  An  MC  which  does  not  match  the 
job  ID  will  present  a  "1"  to  the  data  bus.  If  any  PCU  processor  running  the 
job  meets  the  condition  the  data  bus  will  be  "0,"  however  if  no  PCU  proces¬ 
sor  meets  the  condition  the  bus  will  be  "I."  All  of  the  MCs  will  then  have 
access  to  this  information,  which  will  be  needed  to  execute  the  conditional 


branch  in  the  common  instruction  stream 


The  hardware  required  to  interface  each  MC  to  the  MCCB  is  shown  in  Fig¬ 
ure  13.  The  job  ID  is  transmitted  from  a  single  MC  to  the  MCCB  ID  bus  by 
the  tristate  buffer  only  when  that  MC  is  given  permission  to  use  the  MCCB. 
Then,  each  MC  uses  its  comparator  to  compare  its  job  ID  to  the  MCCB  ID  bus. 

A  major  problem  associated  with  any  bJS  in  a  multiprocessing  system  is 
contention.  The  MCCB  is  allocated  on  a  priority  basis.  A  priority  ring 
based  on  the  physical  addresses  of  the  MCs  is  used.  MC  1*1  modulo  Q  has  a 
higher  priority  than  MC  1,  0  £  i  <  Q.  The  priority  ring  is  broken  by  a  Q 
bit  shift  register  which  contains  exactly  one  "1."  A  "1"  in  the  i*h  bit  of 
the  shift  register  indicates  t l* at  MC  1-1  modulo  Q  has  the  highest  priority, 
and  therefore  MC  i  has  the  lowest  priority.  After  each  bus  cycle  the  shift 
register  is  circularly  shifted  one  bit,  such  that  if  MC  i  had  the  lowest 
priority  it  is  giver,  the  highest  priority  for  the  next  bus  cycle.  Details 
and  a  simulation  are  in  ZU7j. 

The  queue  for  the  communication  bus  simply  consists  of  a  0  bit  register 
which  is  loaded  at  the  end  of  each  bus  cycle.  A  "1"  in  the  i*  bit  of  the 
register  indicates  that  MC  i  is  presently  waiting  to  use  the  communication 
Dus.  After  an  MC  hcs  finished  using  the  bus  it  resets  its  cor respond ing  bit 
to  zero,  thus  removing  itself  from  the  queue.  A  block  diagram  of  the  MCCB 
controller  is  shown  in  Figure  11. 

Thus,  this  relitively  inexpensive  hardware  design  will  allow  rapid  exe¬ 
cution  of  the  "if  any,”  "if  all,"  "if  none,"  etc.,  instructions  without  bur¬ 
dening  the  System  Control  Unit.  In  addition  to  their  use  in  SIMD  programs, 
these  instructions  can  be  used  to  aid  in  synchroni zing  and  coordinating  PCU 
processors  in  a  virtual  MIMD  machine. 
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Communi cat  ions  with  the  PCU  Processors 

The  processors  used  in  the  PCU  are  to  be  constructed  from  user  mi¬ 
croprog  ranmab  l  e  "bit-slice"  components,  available  from  manufacturers  such  as 
Texas  Instruments  T653,  Advanced  Micro  Devices  F21,  and  Intel  T171.  Bit- 
slice  components  are  designed  such  that  several  sets  of  these  components  can 
be  combined  to  form  processors  of  arbitrary  word  length.  Bit-slice  proces¬ 
sor  components  typically  include  a  computational  unit,  a  sequencer,  and  spe¬ 
cial  hardware  to  allow  such  features  as  pipelining  and  lookahead  addition. 
The  computat ional  unit  contains  the  processor's  registers,  the  mechanism  for 
register  transfers,  and  the  mechanism  for  arithmetic  computation.  The 
sequencer  controls  microprogram  execution  by  calculating  the  next  execution 
address.  The  choice  of  bit-slice  processors  over  single  chip  microproces¬ 
sors  at  this  point  in  time  is  due  for  the  most  part  to  the  advantages  of 
speed  and  versatility  seen  in  being  able  to  have  a  16-bit  user  micropro- 
grammable  processor.  By  having  user  microprogrammable  processors,  the  in¬ 
struction  set  can  be  optimised  for  parallel  image  processing. 

The  unique  structure  of  PASM  dictates  that  the  processors  used  in  the 
PCU  be  unique  themselves.  Since  bit-slice  processors  are  microprogrammable 
by  the  user  they  can  be  made  to  function  in  some  special  ways.  The  term 
micro-funct ion  refers  to  the  set  of  tasks  carried  out  by  applying  one  con¬ 
trol  word  to  the  computational  unit.  Typical  bit-slice  processors,  such  as 
Texas  Instrument's  SN74S431  or  Intel's  1000,  offer  a  set  of  commonly  used 
micro-funct ions  as  the  only  micro-funct ions  available  to  the  user.  Micro¬ 
function  instruction  words  can  be  represented  by  fewer  bits  than  the  actual 
control  word  if  the  set  of  available  micro-functions  is  limited.  For  exam¬ 
ple,  the  Texas  Instrument's  SN74S4S1  has  an  11-bit  function  word  and  a 
24-bit  control  word.  The  micro-functions  are  translated  into  control  words 
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by  a  programmed  logic  array.  The  microprogram  of  the  processor  can  then  be 
stored  more  efficiently,  the  number  of  interchip  connections  is  Wept  at  a 
minimum,  and  the  job  of  writing  microcode  can  be  compared  with  that  of  writ¬ 
ing  macrocode.  The  size  of  the  microprogram  store  and  the  number  of  inter¬ 
chip  connections  become  especially  important  when  considering  the  construc¬ 
tion  of  a  large  array  of  processors.  The  cost  of  limiting  the  set  of  avail¬ 
able  micro-functions  is  that  of  limiting  overall  versatility.  However,  most 
bit-slice  processo.-s  offer  a  set  of  micro-functions  complete  enough  for  al¬ 
most  all  practical  purposes.  If  the  micro-functions  provided  are  not  suffi¬ 
cient  for  PASH,  then  a  processor  which  allows  a  customized  encoding  of  con¬ 
trol  words  into  an  allowable  set  of  micro-function  instruction  words  may  be 
used . 

When  PASH  is  operating  in  SIHD  mode,  many  processors  are  executing  the 
same  instruction  stream  in  a  synchronous  fashion.  If  the  HCs  handle  instruc¬ 
tion  decoding  and  sequencing  for  the  sets  of  PCU  processors  under  their  con¬ 
trol,  as  opoosed  to  letting  each  PCU  processor  handle  its  own  instruction 
decoding  and  sequencing,  the  number  of  duplicate  control  memory  stores  and 
sequencing  hardware  chips  is  reduced  by  a  factor  of  N/Q. 

For  a  set  of  pCU  processors  to  operate  in  MIHD  mode,  the  above  scheme 
is  not  sufficient.  The  reason  for  this  stems  from  the  fact  that  each  PCU 
processor  in  this  set  may  execute  a  different  instruction  at  the  same  time. 
Thus,  the  above  scheme  must  be  modified  so  that  a  subset  of  the  PCU  proces¬ 
sors  is  capable  of  handling  its  own  instruction  decoding  and  sequencing 
while  in  HIHD  mode  and  of  allowing  HCs  to  do  these  tasVs  while  in  SIHD  mode. 
The  size  of  such  a  set  of  processors  would  be  dictated  by  the  intended  ap¬ 


plication  of  the  system  in  question.  Initial  studies  for  PASH  indicate  that 
SIHD  mode  will  be  used  much  more  often  than  HIHD  mode  and  that  HIHD  tasVs 
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will,  in  general,  require  fewer  processors.  Tt  may,  therefore,  be  reason¬ 
able  to  expect  that  the  size  of  the  set  of  "dual  mode"  processors  could  be 
much  smaller  than  the  overall  size  of  the  PCU  without  significantly  hindei — 
inq  comput at ional  capability.  Furthermore,  since  the  processors  are  mi  - 
croproqrammabl e  it  is  possible  for  "duat  mode"  processors  to  have  two  in¬ 
struction  sets,  one  for  *I*D  mode  and  one  for  S I  “ID  mode.  The  instruction 
sets  are  discussed  further  in  r46T. 

F .  Enabl ing  and  Disabling  PCU  Processors 

In  STHD  mode,  all  of  the  active  PCU  processors  will  execute  instruc¬ 
tions  broadcast  to  them  by  their  MC .  A  masking  scheme  is  a  method  for 
determining  which  PCU  proce  ,sors  will  be  active  at  a  given  point  in  time. 
An  SI*1D  machine  may  have  several  different  masking  schemes.  The  masking 
5Che>ne  provides  the  system  user  w’th  a  device  to  enable  some  processors  and 
disable  others. 

The  general  mask ( ng  scheme  uses  an  tl-bit  vector  to  determine  which  PCU 
processors  to  activate.  Processor  number  i  will  be  active  if  the  i*  bit  of 
the  vector  is  a  1,  for  0  £  i  <  f.  (where  the  low  order  bit  is  the  0th).  For 
example,  if  N  =  1  and  the  bit  vector  is  90101011,  then  only  processors  0,  1, 
T,  and  5  will  be  active.  Obviously,  a  general  mask  can  activate  any  set  of 
processors.  These  masks  are  specified  in  the  SIHD  program,  and  are  part  of 
the  instruction  stream  broadcast  by  the  control  unit.  A  mask  instruction  is 
executed  whenever  a  change  in  the  active  status  of  the  processors  is  re¬ 
quired.  The  Uliac  IV,  which  has  64  processors  and  64-bit  words,  uses  gen¬ 
eral  masks  r5T1.  However,  when  N  is  larger,  say  1024,  a  scheme  such  as 
this,  where  the  mask  size  is  N  bits,  becomes  less  appealing  in  terms  of  the 
difficulty  in  constructing  and  storing  each  1024-bit  mask. 
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The  PE  address  masking  scheme,  introduced  in  rT61,  uses  an  n-position 
mask  to  specify  which  of  the  N  PCU  processors  are  to  be  activated.  Each  po¬ 
sition  of  the  mask  corresponds  to  a  bit  position  in  the  logical  addresses  of 
the  processors.  Each  position  of  the  mask  will  contain  either  a  0,  1,  or  X 
("don’t  care")  and  the  only  processors  that  will  be  active  are  those  whose 
address  matches  the  mask:  0  matches  0,  1  matches  1,  and  either  P  or  1 
matches  X.  Superscripts  are  repetition  factors;  square  brackets  denote  a 
mask.  For  example,  rxn  activates  all  even  numbered  processors. 

*  i ve  PE  address  mask  is  similar  to  a  regular  PE  address  mask,  ex¬ 
cept  that  it  activates  all  those  processors  which  do  not  match  the  mask. 
Negative  PE  address  masks  are  prefixed  with  a  minus  sign  to  distinguish  them 
from  regular  PE  address  masks.  This  type  of  mask,  introduced  in  TT7T,  can 
activate  sets  of  processors  a  single  regular  PE  address  mask  cannot,  e.g., 
T-OnT  activates  all  processors  except  for  number  0. 

The  way  in  which  PE  address  masks  interact  with  various  interconnection 
networks  is  analysed  in  rT61.  Tn  T4?'’  PE  address  masks  are  used  to  write 
Slip  machine  algorithms.  Other  properties  of  these  masks  are  discussed  in 
rT7, 47,4*1. 

Like  general  masks,  PE  address  masks  are  specified  in  the  ST1D  program. 
°E  address  masks  are  more  restrictive  than  general  masks,  in  that  a  general 
mask  can  activate  any  arbitrary  set  of  processors  and  a  PE  address  mask  can¬ 
not.  However,  for  N  »  54,  general  masks  are  impractical  in  terms  of 
storage  requirements  and  ease  of  programming,  and  so  system  architects  must 
consider  alternat ives.  Together,  regular  and  negative  PE  addresr.  masks  pro¬ 
vide  enough  flexibility  to  allow  the  easy  programming  of  a  variety  of  image 


processing  tasks. 


For  ease  o*  encoding  and  decoding,  two  bits  are  used  to  represent  each 
PE  address  mask  position.  Figure  12  shows  the  mask  word  format  for  each  MC, 
when  N  =  1024.  The  mask  word  consists  of  2n*1  -  21  bits,  which  allows  masks 
having  up  to  n  =  10  positions  and  a  sign  bit  to  be  specified.  S  is  the  sign 
bit  for  the  mask.  Of  the  remaining  ?0  bits,  the  2(n-q)  =  12  high-order  bits 
pertain  to  the  PCU  processors  in  an  MC  group,  while  the  low-order  2q  =  M 
bits  pertain  to  the  MC  addresses.  The  entire  mask  word  is  a  physical  mask 
and  is  used  with  the  physical  addresses  of  the  PCU  processors  and  the  MCs. 
A  logical  mask  is  a  subset  of  the  physical  mask  and  consists  of  only  that 
part  of  the  mask  needed  to  control  the  processors  in  a  partition.  The  Mask 
Decoder  o4  Figure  13  transforms  the  21-bit  mask  word  into  a  64-bit  general 
mask  vector,  *or  N  =  1024  and  9  =  16.  A  Mask  Decoder  is  part  of  each  MC. 
Each  logical  PE  address  mask  is  treated  as  the  high-order  portion  of  the 
physical  mask.  The  portion  of  the  physical  mask  not  considered  part  of  the 
logical  mask  is  initialized  with  "X"s  and  left  unaltered  by  program  execu¬ 
tion.  The  sign  pit  of  the  logical  and  physical  masks  is  the  same.  The 
64-bit  general  mask  vector  generated  is  for  the  64  PCU  processors  associated 
with  that  MC.  This  is  all  that  is  required  to  translate  a  logical  mask  into 
enable  signals. 

The  system  of  Figure  14  allots  *>4-bits  o 4  an  arbitrary  general  mask 
vector  or  the  output  of  the  Pf  address  Mask  Decoder  to  be  sent  to  the  Mask 
Vector  Register  of  ~ach  MC .  If  the  general  masking  scheme  were  implemented 
on  PASM,  a  reforsatt ing  operation  would  be  needed.  To  execute  a  mask  in¬ 
struction,  the  MC s  do  not  broadcast  the  entire  mask  to  each  processor,  but 
transfer  to  a  processor  only  the  one  bit  of  the  mask  that  pertains  to  that 
processor.  Ass  ime  the  mask  is  being  used  in  a  partition  of  size  2P=RN/Q, 
where  n-q  <  p  <  n.  Due  to  the  fact  that  the  physical  addresses  of  all  of 


the  PCU  processors  1  r»  a  partition  must  agree  in  their  n-p  low-order  bit  po¬ 
sitions,  the  system  compiler  must  rearrange  the  programmer's  logical  gene--./, 
mask  to  form  a  phys ical  general  mask .  This  physical  mask  will  be  in  a  form 
which  can  be  used  more  readily  by  the  NCs  th *n  car.  the  logical  mask.  The 
PCU  processor  whose  logical  address  is  Rj  ♦  1  will  be  the  j*h  processor  con¬ 
trolled  by  the  ith  logical  «C ,  0  <  i  <  R,  0  <  j  <  N/Q.  The  logical  mask  bit 
Rj  ♦  i  is  move!  to  the  physical  mask  position  (N/Q)i  ♦  j.  Then  the  WC  whose 
logical  address  is  i  will  load  its  N/Q-bit  mask  register  with  bits  '*N/Q 
through  (i*1)*N/a-1  of  the  physical  mask.  (The  translation  of  logical  MC 
addresses  to  physical  addresses  was  described  in  section  III. A.)  This  method 
of  loading  will  send  the  i*  bit  of  the  logical  general  mask  to  the  i*  log¬ 
ical  PCU  proce ,sor. 

As  a  compromise  between  the  flexibility  of  general  mask  vectors  and  the 
conciseness  of  PE  address  masks,  the  NC  is  allowed  to  fetch  the  64-bit  vec¬ 
tor  from  the  Nask  Vector  Register,  perform  various  logical  operations  on  the 
vector,  and  then  reload  it.  A  logical  OR  of  two  (or  more)  vectors  generated 
by  PE  address  masks  is  equivalent  to  taking  the  union  of  the  sets  of  proces¬ 
sors  activated  by  the  masks.  A  logical  AND  is  equivalent  to  the  intersec¬ 
tion.  The  complement  operation  car  be  used  to  Implement  negative  PE  address 
masks  instead  of  using  the  exclusive-or  gates  shown  in  Figure  It. 

Thus,  PE  address  masks  are  concise,  easily  converted  from  logical  to 
physical  masks,  and  can  be  combined  using  boolean  functions  for  additional 
flexibility.  Examples  of  how  PE  address  masks  can  be  us*d  in  algorithms  for 
image  processing  tasks  are  presented  in  section  V. 

goth  general  mask  vectors  and  PE  address  masks  are  set  at  compile  time. 
Data  conditional  masks  will  be  implemented  in  PASN  for  use  when  the  decision 


to  enable  and  disable  PEs  is  made  at  execution  time. 


Data  conditional  masks  are  the  implicit  result  of  performing  a  condi¬ 


tional  branch  dependent  on  local  data  in  an  SIMD  machine  environment,  where 
the  result  of  different  PEs'  evaluations  may  differ,  as  a  result  of  a  con¬ 
ditional  where  statement  of  the  form 

where  <data  condition>  do  ...  elsewhere  ... 

each  PE  will  set  its  own  internal  flag  to  activate  itself  for  either  the 
"do''  or  the  "el sewhere,*'  but  not  both.  The  execution  of  the  "elsewhere" 
statements  must  follow  the  "do"  statements;  i.e.,  the  "do"  and  "elsewhere" 
statements  cannot  be  executed  s imul t aneous l y .  Tor  example,  as  a  result  of 
executing  the  statement: 

where  A  >  3  do  C  *  A  elsewhere  C  ♦  B 

each  PE  will  load  its  C  register  w!th  the  maximum  of  its  A  and  B  registers, 
i.e.,  some  PEs  will  execute  "C  ♦  A,"  and  then  the  rest  will  execute  "C  •  B." 
This  type  of  masking  is  used  ir.  such  machines  as  the  III  i ac  IV  TT1  and  PEPE 
HOT.  "Where  statements"  can  be  nested  using  a  run-time  control  stack. 
This  is  discussed  in  '461. 

TV.  SECONDARY  MEMORY  SYSTEM 

A .  Int  r oduc  t ion 

The  Memory  Management  System  in  PASM  will  have  its  own  intelligence  and 
will  us*  the  parallel  secondary  storage  devices  of  the  Memory  Storage  Sys- 
t*m.  Giving  the  Memory  Management  System  its  own  intelligence  will  help 
p-event  the  System  Control  Unit  from  being  overburdened.  The  parallel 
secondary  storage  devices  will  allow  fast  loading  and  unloading  of  the  N 
double-buffered  PCU  memory  modules  and  will  provide  storage  for  system  image 
a  >  picture  data  and  NIMD  programs. 
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0.  Memory  Storage  System 

Secondary  storage  for  the  PCU  memory  modules  Is  provided  by  the  Memory 
Storage  System.  The  Memory  Storage  System  will  consist  of  N/Q  independent 
Memory  Storage  units,  where  N  is  the  number  of  PCU  memory  modules  and  Q  is 
the  number  of  MCs.  The  Memory  Storage  units  will  be  numbered  from  Q  to 
(N/Q)-1.  Each  Memory  Storage  unit  is  connected  to  Q  PCU  memory  units.  For 
0  <  i  <  N/Q,  Memory  Storage  unit  i  is  connected  to  those  memory  modules 
whose  physical  addresses  are  of  the  form  (Q  *  i)  ♦  k,  0  £  k  <  Q.  Recall 
that,  for  0  £  k  <  Q,  MC  k  is  connected  to  those  processors  whose  physical 
addresses  are  of  the  form  (Q  *  i)  ♦  k,  0  £  i  <  N/Q.  Thus,  Memory  Storage 
unit  i  is  connected  to  the  i*  processor/memory  module  pair  of  each  MC . 
This  is  shown  for  N  =  82  and  Q  =  4  in  Figure  15. 

The  two  main  advantages  of  this  approach  for  a  partition  of  size  N/Q 
are  that  (1)  all  of  the  memory  modules  can  be  loaded  in  parallel  and  (2)  the 
data  is  directly  available  no  matter  which  partition  (MC  group)  is  chosen. 
This  is  done  by  storing  in  Memory  Storage  unit  i  the  data  for  a  task  which 
is  to  be  loaded  into  the  i*S  memory  module  of  the  virtual  machine  of  size 
N/Q,  0  <  l  <  N/Q.  Memory  Storage  unit  i  is  connected  to  the  i**  memory 
module  in  each  MC  group  (i.e.,  memory  modules  Q  *  i ,  (Q  •  i )  ♦  1 ,  (Q  *  i )  ♦ 
2,...).  Thus,  no  matter  which  MC  group  of  N/Q  processors  is  chosen,  the 
data  from  the  i*h  Memory  Storage  unit  can  be  loaded  into  the  i*  memory 
module  of  the  virtual  machine,  for  all  i,  0  £  i  <  N/Q,  simultaneously. 

For  example,  in  Figure  15,  if  the  partition  of  size  N/Q  =  8  chosen  con¬ 
sists  of  the  processors  connected  to  MC  2,  then  Memory  Storage  unit  0  will 
toad  memory  module  2,  1  will  load  ft,  2  will  load  10,  etc.  If  Instead  MC  T,s 
processors  are  chosen.  Memory  Storage  unit  0  will  load  memory  module  5,  1 
will  load  7,  2  will  load  11,  etc. 


Thus,  Tor  virtual  machines  of  size  N/Q,  this  secondary  storage  scheme 
allows  all  N/Q  memory  modules  to  be  loaded  in  one  parallel  block  transfer. 
This  same  approach  can  be  taken  if  only  (N/Q)/?0*  distinct  Memory  Storage 
units  are  available,  where  0  <  d  <  n-q.  In  this  case,  however,  2 d  parallel 
block  loads  will  be  required  instead  of  just  one. 

Consider  the  situation  where  a  virtual  machine  of  size  RN/Q  is  desired, 
1  <  R  £  Q,  and  there  are  N/Q  Memory  Storage  units.  In  general,  a  task  need¬ 
ing  RN/Q  processors,  logically  numbered  0  to  RN/Q-1,  will  require  R  parallel 
block  loads  if  the  data  for  the  memory  module  whose  high-order  n-q  logical 
address  bits  equal  i  is  loaded  into  Memory  Storage  unit  i.  This  is  true  no 
matter  which  group  of  R  MCs  (which  agree  in  their  low-order  q-r  address 
bits)  is  cnosen. 

For  example,  consider  Figure  15,  where  N  =  3?  and  0=4.  Assume  that  a 
virtual  machine  of  size  16  is  desired.  The  data  for  the  memory  modules 
whose  logical  addresses  are  0  and  1  is  loaded  into  Memory  Storage  unit  0, 
for  memory  modules  2  and  3  into  unit  1,  for  memory  modules  4  and  5  into  unit 
2,  etc.  Assume  the  partition  of  size  16  is  chosen  to  consist  of  the  proces¬ 
sors  connected  to  MCs  0  and  2  (i.e.,  all  even  physically  numbered  proces¬ 
sors).  Then  the  Memory  Storage  units  first  load  memory  modules  physically 
addressed  0,  4,  8,  12,  16,  20,  24,  and  23  (simultaneously),  and  then  load 
memory  modules  2,  6,  10,  14,  18,  22,  26,  and  30  (s imul taneous l y) .  Given 
this  assignment  of  MCs,  the  PCU  memory  module  whose  physical  address  is  2  • 
i  has  logical  address  i,  0  £  i  <  16.  Assume  the  processors  and  memory 
modules  associated  with  MCs  1  and  3  are  chosen.  First,  memory  modules  phy¬ 
sically  addressed  1,  5,  9,  13,  17,  21,  25,  and  29  are  loaded  simultaneously, 
and  then  modules  3,  7,  11,  15,  19,  23,  27,  and  31  are  loaded  s imul taneousl y. 
In  this  case,  the  memory  module  whose  physical  address  is  (2  *  i)  ♦  1  has 
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logical  address  i,  0  <  i  <  16.  No  matter  which  pair  of  MCs  is  chosen,  only 
two  parallel  block  loads  are  needed. 

Thus,  for  a  virtual  machine  of  size  RN/Q,  this  secondary  storage  scheme 
allows  all  RN/O  memory  modules  to  be  loaded  in  R  parallel  block  transfers, 
1  <  R  <  Q.  If  only  (N/Q)/2d  distinct  Memory  Storage  units  are  available, 
0  <  d  <  n-q,  then  R  *  2d  parallel  block  loads  will  be  required  instead  of 
just  R. 

The  actual  devices  that  will  be  used  as  Memory  Storage  units  will 
depend  upon  the  speed  requirements  of  the  rest  of  PASM,  cost  constraints, 
and  the  state  of  the  art  of  storage  technology  at  implementation  time.  Pos¬ 
sibilities  to  be  investigated  include  disks,  bubble  memories,  and  CCDs. 

C .  Hand l ing  Large  Pat  a  Sets 

The  Memory  Management  System  makes  use  of  the  docfole-buf f ered  arrange¬ 
ment  of  the  memory  modules  to  enhance  system  throughput.  The  scheduler,  us¬ 
ing  information  from  the  System  Control  Unit  such  as  the  number  of  PCU  pro¬ 
cessors  needed  and  maximum  allowable  run  time,  will  sequence  tasks  waiting 
to  execute  T491.  Typically,  all  of  the  data  for  a  task  will  be  loaded  into 
the  appropriate  memory  units  before  execution  begins.  Then,  while  a  proces¬ 
sor  is  using  one  of  its  memory  units,  the  Memory  Management  System  can  be 
unloading  the  other  unit  and  then  loading  it  for  the  next  task.  When  the 
task  currently  executing  completes,  the  PCU  processor  can  switch  to  its  oth¬ 
er  memory  unit  for  doing  the  next  task.  9ased  on  image  processing  and  pat¬ 
tern  recognition  tasks  which  have  been  examined  thus  far,  it  appears  that 
the  use  of  double-buffering,  the  potentially  large  memory  modules,  the  num¬ 
ber  0f  processors  in  each  virtual  machine,  and  the  special  purpose  design  of 
PASM  make  time  sharing  of  the  PCU  processors  and  the  use  of  convent ional 


paging  inappropr 1  ate . 
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There  nay  be  some  cases  where  all  of  the  data  will  not  fit  into  the  PCU 
memory  space  allocated.  Assune  a  memory  frame  is  the  amount  of  space  used 
in  the  PCU  memory  units  for  the  storage  of  data  from  secondary  storage  for  a 
particular  task.  There  are  tasks  where  many  memory  frames  are  to  be  pro¬ 
cessed  by  the  same  program  (e.g.,  maximum  likelihood  classification  of  sa¬ 
tellite  data  r6?]).  The  double-buffered  memory  modules  can  be  used  so  that 
as  soon  as  the  data  in  one  memory  unit  is  processed,  the  processor  can 
switch  to  the  other  unit  and  continue  executing  the  same  program.  When  the 
processor  is  ready  to  switch  memory  units,  it  signals  the  Memory  Management 
System  that  it  has  finished  using  the  data  in  the  memory  unit  to  which  it  is 
currently  connected.  Hardware  to  provide  this  signaling  capability  can  be 
provided  in  different  ways,  such  as  using  interrupt  lines  from  the  proces¬ 
sors  or  by  using  logic  to  check  the  address  lines  between  the  processor  and 
its  memory  modules  for  a  special  address  code.  The  processor  checks  a  data 
identification  tag  to  ensure  that  the  new  memory  frame  is  available,  and 
then  switches  memory  units,  assuming  the  data  is  present.  The  Memory 
Management  System  can  then  unload  the  "processed"  memory  unit  and  then  load 
it  with  the  next  memory  frame  or  next  task.  Such  a  scheme,  however,  re¬ 
quires  some  mechanism  which  can  make  variable  length  portions  o ♦  programs  or 
data  sets  (i.e.,  local  data)  stored  in  one  unit  of  a  memory  module  available 
to  the  other  unit  when  the  associated  processor  switches  to  access  the  next 
memory  frame.  Th * s  scheme  would  be  used  only  when  multiple  memory  frames 
are  to  be  processed. 

One  method  to  do  this  maintains  a  copy  of  local  data  in  both  memory 
units  associated  with  a  given  processor  so  that  switching  memory  units  does 
not  alter  the  local  variable  storage  associated  with  the  processor.  A  pos¬ 
sible  hardware  arrangement  to  implement  this  makes  use  of  two  characteris- 
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tics  of  the  PASM  memory  access  requirements:  (1)  secondary  memory  wilt  not 
be  able  to  load  a  given  memory  unit  at  the  maximum  rate  it  can  accept  data, 
and  (2)  PCU  processors  will  not  often  be  able  (or  desire)  to  write  to  memory 
on  successive  memory  cycles.  9ecause  of  these  two  characteristics,  proces¬ 
sor  stores  to  local  variable  storage  locations  in  an  active  memory  unit  can 
be  trapped  by  a  bus  interface  register  and  stored  in  the  inactive  memory 
unit  by  stealing  a  cycle  on  the  secondary  memory  bus.  In  essence,  this 
technique  makes  use  of  the  conventional  store-through  concept  as  described 
in  C16,?33.  An  exception  to  the  second  characteristic  mentioned  above  is 
multiple  precision  data.  If  16  bit  words  are  assumed,  then  for  higher  pre¬ 
cision  it  may  be  desirable  to  use  two  or  four  words  as  a  group.  However,  a 
simple  buffering  scheme  can  handle  this  possibility. 

The  method  described  is  applicable  to  any  system  which  allows  its  pro¬ 
cessing  tasks  to  utilise  several  separate  memories  and  which  requires  that 
identical  copies  of  variable  amounts  of  certain  data  be  maintained  in  all 
memories  so  used.  Further  information  about  this  scheme  and  a  discussion  of 
other  possible  methods  for  providing  local  data  storage  are  presented  in 
C433. 

0.  At  ter ing  Loading  Sequences 

To  further  increase  the  flexibility  of  PASM,  a  task  may  alter  the  se¬ 
quence  of  data  processed  by  it  during  execution.  As  an  example,  consider  a 
task  which  is  attempting  to  identify  certain  features  in  a  series  of  images. 
The  task  might  examine  a  visible  spectrum  copy  of  an  image  and,  based  on 
features  identified  in  the  image,  choose  to  examine  an  infrared  spectrum 
copy  of  the  same  image.  Rather  than  burden  the  System  Control  Unit  to  per¬ 
form  data  loading  sequence  alterations,  the  task  is  allowed  to  communicate 
directly  with  the  Memory  Management  System. 
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In  the  case  of  an  SIMt>  task,  the  associated  MC(s)  determines  If  changes 
are  required  In  the  data  loading  sequence  for  the  task.  If  so,  an  MC  speci¬ 
fies  the  nature  of  the  changes  and  communicates  them  to  the  Memory  Manage¬ 
ment  System  without  Involving  the  System  Control  Unit.  Each  MC  has  the 
ability  to  generate  loading  sequence  changes  through  a  PCU  processor.  for 
tasks  which  require  R  MCs  (1  <  R  <  Q)  logically  numbered  0  to  R-1,  logical 
MC  0  will  handle  loading  sequence  changes.  MC  0  uses  logical  processor  num¬ 
ber  3  of  the  virtual  machine  to  establish  a  control  Information  list  in  log¬ 
ical  PCU  memory  modjle  0.  (The  Q  PCU  processors  which  can  possibly  be  logi¬ 
cally  numbered  0  in  a  virtual  machine  are  physically  numbered  0,  1, 
2,. ..,0-1.)  This  list  specifies  In  a  concise  fashion  the  loading  sequence 
alterations  required.  The  MC  initiates  the  transfer  of  this  list  to  the 
Memory  Management  System  by  using  logical  processor  0  to  write  a  pointer  to 
the  list  into  the  highest  addressable  memory  location  of  its  memory  module. 
Through  the  use  of  a  simple  address  comparison,  the  write  into  this  memory 
location  generates  an  Interrupt  to  the  Memory  Management  System.  The  Memory 
Management  System  recognizes  the  interrupt  as  a  request  for  a  loading  se¬ 
quence  change  and  determines  which  MC  is  making  the  request.  The  Memory 
Management  System  uses  the  list  of  control  information  (via  the  pointer  pro¬ 
vided)  to  determine  the  loading  sequence  changes  required. 

An  alternative  method  of  interrupt  generation  is  to  use  an  Interrupt 
line  from  each  of  the  Q  possible  logical  PCU  processor  0's  to  the  Memory 
Management  System.  The  method  selected  for  interrupt  generation  will  depend 
upon  the  interrupt  capabilities  of  the  microprocessor  used  in  the  PCU. 
While  the  loading  sequence  control  information  could  be  passed  directly  from 
the  MCs  to  the  Memory  Management  System,  the  length  of  the  connections  re¬ 
quired  may  make  implementation  more  difficult  and  costly. 
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One  hardware  scheme  which  can  transfer  the  control  information  list 
from  a  PCU  memory  module  to  the  Memory  Management  System  is  shown  in  Figure 
16.  The  hardware  system  shown  is  based  upon  having  the  Memory  Management 
System  coordinate  the  recognition  of  processor  interrupts  and  the  associated 
transfers  of  control  information  lists  from  the  memory  modules.  The  inter¬ 
rupt  recognition  portion  is  handled  by  the  Processor  Interrupt  Control  Logic 
while  the  transfer  of  control  information  lists  is  handled  by  the  Memory 
Module  Access  Control  Logic.  Suppose  processor  i  establishes  a  control  in¬ 
formation  list  in  one  of  its  memory  units  and  writes  the  pointer  to  the  list 
into  its  highest  addressable  memory  location,  generating  a  signal  to  the  In¬ 
terrupt  Control  on  Interrupt  Request  Line  i,  where  0  £  i  <  0.  The  Interrupt 
Control  signals  the  Memory  Management  System,  which  then  uses  the  Access 
Control  to  read  the  control  information  list  from  memory  module  i.  Finally, 
the  Memory  Management  System  signals  the  Interrupt  Control  to  generate  a 
signal  on  the  Interrupt  Accepted  Line  to  processor  i. 

The  same  hardware  arrangement  described  for  SIMD  tasks  is  used  for  MIMD 
tasks.  With  each  grcup  of  N/Q  MIMD  processors,  there  is  associated  a  memory 
supervisor  which  is  logical  processor  3  within  the  group.  All  processors 
associated  with  a  given  memory  supervisor  make  requests  for  loading  sequence 
changes  through  the  memory  supervisor,  without  involving  the  MCs  or  System 
Control  Unit.  This  reduces  System  Control  Unit  contention  problems  and 
helps  prevent  the  MC(s),  possibly  busy  orchestrat ing  the  activities  of  the 
virtual  MIMD  machine,  from  becoming  overburdened . 

£.  Memory  Management  System 

A  set  of  microprocessors  is  dedicated  to  performing  the  Memory  Manage¬ 
ment  System  tasks  in  a  distributed  fashion,  i.e.,  one  processor  handles 
"cmory  Storage  System  bus  control,  one  handles  the  scheduling  tasks,  etc. 
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This  distributed  processing  approach  is  chosen  in  order  to  provide  the 
Memory  Management  System  with  a  large  amount  of  processing  power  at  low 
cost.  Requests  coming  from  different  devices  can  be  handled  simultaneously . 
In  addition,  dedicating  specific  microorocessors  to  certain  tasks  simplifies 
both  the  hardware  and  software  required  to  perform  each  task. 

The  basic  architecture  of  the  Memory  Management  System  is  shown  in  Fig¬ 
ure  17.  A  master  processor  coordinates  the  concurrent  tasks  executed  by  the 
slave  processors.  A  shared  memory  approach  is  planned  due  to  the  need  to 
share  data  such  as  task  queues.  To  reduce  contention  for  the  shared  memory, 
each  processor  uses  a  local  ROM  and  RAM  for  storage  of  code  and  local  data. 
In  addition,  if  necessary  the  shared  memory  may  be  interleaved  F23'1  to 
further  reduce  contention.  The  degree  of  interleaving  desirable  may  be 
determined  by  simulation  studies  or  queuing  theory  analysis  r141  of  the 
Memory  Management  System.  The  processors  within  the  Memory  Management  Sys¬ 
tem  may  be  implemented  using  commercially  available  fixed  instruction  set 
microprocessors.  The  new  generation  of  16-bit  processors  C24, 30,35,3*71  is 
particularly  appropriate  since  many  provide  special  hardware  for  operations 
such  as  locked  increment  and  test,  memory  protection  and  management,  and 
problem/supervisor  state  switching. 

The  division  of  tasks  chosen  is  based  on  the  main  functions  which  the 
Memory  Management  System  must  perform,  including:  (1)  generating  tasks  based 
on  PCU  memory  module  load/unload  requests  from  the  System  Control  Unit;  (2) 
interrupt  handling  and  generating  tasks  for  data  loading  sequence  changes 
requested  by  the  PCU  processors  physically  numbered  0  to  0-1  (see  previous 
subsection);  (3)  scheduling  of  Memory  Storage  System  data  transfers;  (4) 
control  of  input/output  operations  involving  peripheral  devices  and  the 
Memory  Storage  System;  (5)  maintenance  of  the  Memory  Management  System  file 
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directory  informat  ion;  and  (6)  control  of  the  Memory  Storage  System  bus  sys¬ 
tem. 

Most  Memory  Management  System  operations  will  be  initiated  by  the  Sys¬ 
tem  Control  Unit,  so  the  master  processor  will  communicate  with  the  System 
Control  Unit  and  perform  the  task  spawning  operations  associated  with  its 
requests.  **CU  processor  interrupt  handling  is  assigned  to  one  slave  proces¬ 
sor,  which  sends  requests  for  loading  sequence  changes  to  the  scheduler. 
Scheduling  of  Memory  Management  System  data  transfers  is  assigned  to  another 
dedicated  slave  processor  since  the  scheduling  of  data  transfers  will  be 
complex  and  time  consuming  if  near  optimal  operation  of  this  system  is  to  be 
realized.  One  slave  is  devoted  to  input/output  between  the  Memory  Storage 
System  and  peripheral  devices,  such  as  magnetic  tape  units.  It  handles  any 
communications  with  the  peripheral  devices  and  arranges  access  to  the  Memory 
Storage  units.  The  control  and  maintenance  of  the  Memory  Management  System 
file  system  may  be  assigned  several  slave  processors  since  file  location 
operations  associated  with  N/Q  secondary  storage  devices  ma>  exceed  the  pro¬ 
cessing  capabilities  of  one  processor.  »l ternat ively,  a  microprocessor  may 
be  assigned  to  each  Memory  Storage  unit  for  file  directory  maintenance  (e.g. 
intelligent  disks),  with  a  single  slave  coordinating  this  activity.  Final¬ 
ly,  another  stave  processor  is  devoted  to  performing  the  operations  associ¬ 
ated  with  setting  the  Memory  Storage  System  buses’  control  signals  needed  to 
connect  each  Memory  Storage  unit  to  the  appropriate  PCU  memory  module. 

The  hardware  structure  of  the  Memory  Management  System  is  such  that  ad¬ 
ditional  slave  processors  can  be  added  to  perform  tasks  that  are  not  con¬ 
sidered  to  be  part  of  the  Memory  Management  System  at  this  time.  In  an  ac¬ 
tual  prototype  Memory  Management  System,  interfaces  for  additional  slave 
processors  would  be  provided  to  facilitate  system  expansion  and  the  incor- 
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poration  of  new  features  into  the  Memory  Management  System. 

V.  IMAGE  PROCESSING  ON  PASM 

A^  Introduction 

In  this  section,  some  examples  of  how  PASM  can  be  used  to  expedite  im¬ 
age  processing  tasks  are  presented.  A  nigh  level  language  algorithm  to 
smooth  an  image  and  build  a  histogram  demonstrates  some  of  the  features  of  a 
programming  language  for  PASM.  Implementations  of  this  algorithm  and  of  a 
two-dimensional  Fast  Fourier  Transform  algorithm  demonstrate  the  ways  in 
which  PASM's  parallelism  may  be  used  to  obtain  significant  reductions  of  ex¬ 
ecution  time  on  computat ional ly  expensive  tasks. 

Ideally  a  high  level  language  for  image  processing  will  allow  algo¬ 
rithms  for  image  processing  tasks  to  be  expressed  easily.  As  an  example,  a 
high  level  larguage  algorithm  which  first  smooths  an  image  and  then  builds  a 
histogram  is  given  in  subsection  B.  The  language  constructs  used  are 
described  in  detail  in  r44l.  The  language  being  designed  for  PASM  is  a  pro¬ 
cedure  based  structured  language  which  allows  the  use  of  index  sets  similar 
to  those  in  TRANQUIL  HI.  The  characteristics  of  the  image  processing  prob¬ 
lem  domain  are  being  used  to  facilitate  compiling.  Analyses  of  image  pro¬ 
cessing  algorithms,  such  as  those  presented  here,  are  being  employed  to 
identify  efficient  techniques  for  performing  common  tasks  and  to  define 
storage  allocation  strategies.  Work  is  currently  being  done  on  the  design 
of  an  intelligent  compiler  which  incorporates  information  about  commonly 
used  parallel  processing  technigues  that  appear  in  image  processing  tasks. 
In  the  next  subsection,  a  high  level  language  algorithm  is  presented,  fol¬ 
lowed  by  a  discussion  of  the  impl ?«entat ion  of  the  algorithm  on  PASM. 


-  35 


B_.  Smoothing  and  Histogram  Algorithms 

The  algorithm  "picture/'  shown  in  Figure  18,  has  "pixin"  as  an  input 
image  and  "pixout"  as  an  output  image.  Both  pixin  and  pixout  contain  51?  by 
512  pixels.  Each  point  of  pixin  is  an  eight  bit  unsigned  integer  represent¬ 
ing  one  of  256  possible  gray  levels.  Each  point  in  the  smoothed  image, 
pixout(i,j),  has  the  average  gray  level  of  pixin(i,j)  and  its  eight  nearest 
neighbors,  pi x in( 1 , j *1 ) ,  pi x in( i , j -1 ) ,  pi x in( i-1 , j) ,  pi x in( i+1 ,j ) , 
pixin( i-1 , j-1 ) ,  pi x i n( i -1 , j +1 ) ,  pi x in( i+1 , j ♦1 ) ,  and  pi x in( i  +  1 , j -1 ) .  Houn- 
dary  points  of  pixout  are  not  calculated  since  their  correspond ing  points  do 
not  have  eight  adjacent  neighbors.  The  "picture"  routine  also  produces  a 
256  bin  (one  bin  for  each  gray  level)  histogram,  "hist,"  of  the  smoothed  im¬ 
age. 

Consider  how  this  could  be  implemented  on  PAS1,  with  N  =  1224.  Assume 
that  the  1024  PCs  are  logically  arranged  as  an  array  of  32  by  32  PEs,  and 
that  the  PE  addresses  range  from  0  to  1023: 

PE  0  PE  1  .  .  .  PE  31 

PE  32  PE  33  .  .  .  pE  63 

* 

»E  992  .  PE  1023 

Each  PE  stores  a  16  by  16  block  of  the  512  by  51?  image.  Assume  that  the  16 
by  16  blocks  are  stored  in  w  major  order.  Thus,  PE  0  stores  the  pixets  in 
columns  0  to  15  of  rows  0  to  ,  PE  1  stores  the  pixels  in  columns  16  to  31 
of  rows  0  to  15,  and  so  on.  In  general,  N  *  ?n  PEs  operate  on  a  picture  of 

2*  by  2*  pixels.  The  pEs  are  arranged  as  a  by  array,  where  n/2  is 

an  integer,  such  that  each  PE  stores  in  its  memory  a  block  of  pixels  of  size 

,k-(n/2)  w  ,k-(n/2> 

2  by  2 


For  notational  purposes,  let  each  PE  consider  its  16  by  16  matrix  as 
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H  = 


hn,0) 

|hm,n> 


hn,io  h(o,i5) 
♦>(15,15) 


Uso,  let  the  subscripts  of  h(i,j)  extend  to  -1  and  14,  in  order  to  aid  in 

calculations  across  boundaries  of  two  adjacent  blocks  in  different  PCs.  For 

e*a">ple,  the  p'»el  to  the  left  of  h(°,0)  is  h(0,-1),  and  the  pixel  below 

h (1 5,15)  is  h (16,15) .  Therefor*,  -1  <  i,j  <  16. 

A  general  algorithm  to  perform  the  smoothing  on  pixel  h(i,j)  to  yield 

smoothed  pixel  hs(i,j)  is: 

♦or  i  »  0  to  15  do 

for  j  »  0  to  15  do 

hs ( i , j )  *1/9  *  (  h(i*1,j)  ♦  h( i-1 , j )  ♦  h(i,j*1)  ♦  h(i,j-1) 

♦  h ( i »1 , j -1 )  ♦  h(la1,j*1)  ♦  h(i-1,J*1)  ♦  h C1-1 , J-1 >  ♦  h(i,j)  ) 

The  approach  of  this  a'gorithm  is  to  perform  1024  16  by  16  pixel  evaluations 

in  parallel  rather  then  one  512  by  512  pixel  evaluation  as  in  the  sequential 

algor ithm. 

At  the  boundaries  of  each  16  by  16  array,  data  must  be  transmitted 
between  PEs  in  order  to  calculate  the  smoothed  value,  hs.  For  example, 
hO,16>  must  be  transferred  from  the  PE  “to  the  right  of"  the  local  PE,  ex¬ 
cept  for  PEs  31 ,43.35, ... ,1023,  those  at  the  "right  edge"  of  the  logical  ar¬ 
ray  of  PEs.  (Pixels  on  the  edges  of  the  512  by  512  array  are  not  smoothed.) 
The  necessary  data  transfers  must  be  performed  before  the  algorithm  above  is 
executed.  The  transfer  of  data  from  PE  1*1  is  illustrated  as  follows. 
h’(i,j)  denotes  an  entry  in  the  M  matrix  of  PE  i*1. 
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PE  i  PE  1+1 

h(0,0)  .  .  .  hO,14)  h(D,15)  h(0,16)  *  h*(0,0) 

h(’,0)  .  .  .  h(1,14>  h(1 ,15)  h(1,16)  *  h'<1,0) 

.  .  .  •  •  • 

( 15,0)  .  .  .  h (1 5,14)  h(15,15)  h(15,16)  *  h’(15,0) 

The  compiler  generated  code  for  this  transfer  can  be  expressed  as  fol¬ 
lows.  "Set  ICN  to  PE*j"  sets  the  interconnect  Ion  network  so  that  PC  P  sends 
data  to  PE  P  ♦  j  modulo  N,  0  <  j  <  N.  °Es  transfer  data  through  Data 
Transfer  Registers.  The  data  are  loaded  Into  the  DTR 1 n  of  each  PE,  the 
TRANSFER  command  moves  the  data  through  the  network,  and  the  final  data  are 
retrieved  trom  DTRout  r551.  The  command  NAS*  f address  set  1  Is  a  PE  address 
mask  that  determines  which  PCs  will  execute  the  instructions  that  follow. 
The  absence  o*  »  mask  implies  all  **Es  are  active.  The  mask  T-X^l  1  deac- 
'ivates  the  PEs  on  the  right  side  of  the  Image,  i.e.,  PEs  51 ,65, . . . ,1025 . 

SET  ICN  to  PE-1; 

DTR in  ♦  h (0  .  15,  0); 

TRANSFER; 

*A$K  C-*ST51  h(0  •  15,  16)  *  DTRout; 

The  transfers  of  data  for  the  remaining  three  sides  of  the  array  M,  i.e., 
from  PEs  1-1,  1»52,  and  1-52,  are  accomplished  in  a  similar  manner.  Smooth¬ 
ing  o*  the  four  points  h(0,0),  h(0,15),  h(15,0),  and  h(15,15)  require  data 
that  reside  In  PEs  1-55,  1-51,  1*51,  and  1*53,  respectively.  This  necessi¬ 
tates  four  additional  parallel  pixel  transfers.  Doth  the  multistage  Cube 
and  PH2I  networks  can  perform  each  of  these  connections  In  a  single  pass. 

In  order  to  perform  a  smoothing  operation  on  a  512  by  512  image  by  the 
parallel  smoothing  of  256  point  blocks  of  site  16  by  16,  the  total  number  of 
parallel  word  transfers  is  4(16)  ♦  4  *  63.  The  corresponding  sequential  al- 
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gorithm  needs  no  data  transfers  between  PEs,  but  calculates  hs  for  512  *  51? 
*  262,144  points.  If  no  data  transfers  were  needed,  the  parallel  algorithm 
would  be  faster  than  the  sequential  algorithm  by  a  factor  of  262,144/256  = 
1024.  If  it  is  assumed  that  each  parallel  data  transfer  requires  at  most  as 
much  time  as  one  smoothing  operation,  then  the  time  factor  is  262,144/524  = 
$09.  That  is,  the  parallel  algorithm  is  about  three  orders  of  magnitude 
faster  than  the  sequential  algorithm.  The  approximation  is  a  conservative 
one,  since  calculating  the  addresses  of  the  nine  pixels  involves  nine  multi¬ 
plications  using  the  subscripts  r 1 5 0 .  Block  transfers  of  the  data  on  the 
four  sides  of  the  array  could  be  done  rapidly  with  a  pipelined  multistage 
interconnect  ion  network  C531.  (Another  factor  which  must  be  considered  in 
evaluating  the  speedup  is  processor  speed.  An  IBM  370  will  process  data 
faster  than  a  typical  microprocessor.  Even  with  possible  differences  in 
processing  speed,  PASM  will  still  perform  this  task  at  least  two  orders  of 
magnitude  faster.) 

Now  consider  implementing  the  histogram  calculation.  Since  the  imaqe 
hs  is  spread  ou‘  over  1024  PEs,  each  PE  will  calculate  a  256  bin  histogram 
based  on  its  16  by  16  segment  of  the  image.  Then  these  ’’local"  histograms 
will  be  combined  using  the  algorithm  described  below.  This  algorithm  is 
demonstrated  for  N  =  16  and  9=4  bins,  instead  of  N  =  1024  and  B  *  256 
bins,  in  Figure  19.  Both  the  multistage  Cube  and  PM2I  networks  can  perform 
each  of  the  needed  transfers  in  a  single  pass. 

In  the  first  b  =  log.,B  steps,  each  block  of  B  PEs  performs  9  simultane¬ 
ous  recursive  doublings  to  compute  the  histogram  for  the  portion  of  the  im¬ 
aqe  contained  in  the  block.  At  the  end  of  the  b  steps,  each  PC  has  one  bin 
of  this  partial  histogram.  The  way  in  which  this  is  accomplished  is  by 
first  dividing  the  0  PEs  of  a  block  into  two  groups.  Each  group  accumulates 


the  sums  for  half  o*  the  bins,  and  sends  the  bins  it  is  not  accumulating  to 
the  group  which  is  accumulating  those  bins.  At  each  step  of  the  algorithm, 
each  group  of  PEs  is  divided  in  half  such  that  the  PEs  with  the  lower  ad¬ 
dresses  form  one  group,  and  the  PEs  with  the  higher  addresses  form  another. 
For  e«ample,  in  Figure  19  in  the  first  step,  the  group  of  PEs  4, 5, 6,  and  7 
is  divided  into  the  two  groups  of  PEs  4  and  5,  and  6  and  7.  The  accumulated 
sums  are  similarly  divided  in  half  based  on  their  indices  in  the  histogram. 
Next,  the  group  with  the  lower  PE  addresses  sends  the  group  with  the  higher 
addresses  the  half  of  the  sums  which  has  the  higher  indices,  while  the  group 
with  the  higher  addresses  sends  the  group  with  the  lower  addresses  the  sums 
with  the  lower  indices.  For  example,  in  Figure  19,  in  the  first  step,  PEs  4 
and  5  send  their  bins  2  and  3  data  to  PEs  6  and  7.  At  the  same  time  PEs  6 
and  7  send  their  b*ns  0  and  1  data  to  PCs  4  and  5.  A*t er  b  steps,  each  PE 
has  the  total  value  for  one  bin  from  the  portion  of  the  image  contained  in 
the  0  PEs  in  its  block. 

The  next  log.,(N/9>  steps  combine  the  results  for  these  blocks  to  yield 
the  histogram  of  the  entire  image  spread  out  over  9  PEs,  with  the  sum  for 
bin  i  in  processor  i,  0  <_  i  <3.  This  is  done  by  performing  log,(N/B)  steps 
o*  a  recursive  doubling  algorithm  to  sum  the  partial  histograms  from  the  N/3 
blocks.  This  is  shown  by  the  last  two  steps  of  Figure  19.  A  general  algo¬ 
rithm  to  compute  the  0  bin  histogram  *o r  an  image  spread  out  over  N  PEs  is 
shown  in  Figure  20.  For  the  ''picture''  algorithm  example  discussed  above,  N 
*  1024  and  0  «  256. 

A  sequential  algorithm  for  calculating  hist  for  a  512  by  512  picture 
requires  512  *  512  *  262,144  additions.  The  parallel  algorithm  described 
above  uses  16  *  16  3  256  additions  for  each  PE  to  calculate  its  "local"  his¬ 
togram,  and  255  ♦  2  *  257  steps  (transfer  and  add)  to  merge  the  histogram 


-  40  - 


into  the  first  256  PEs.  At  step  i  in  the  computation  of  the  partial  histo¬ 
grams,  0  <  i  <  log.,B,  the  number  of  data  transfers  required  is  9/(21+1).  A 
total  of  3-1  transfers  are  performed  in  the  first  log?9  steps  of  the  algo¬ 
rithm.  Log.,(N/B)  parallel  transfers  are  needed  to  combine  the  partial  his- 

C. 

tograms.  In  general,  therefore,  this  technique  requires  3-1 ♦ log, IN/B) 
parallel  transfer/add  operations,  where  N  8  and  B  is  a  power  of  two,  plus 
the  additions  needed  to  compute  the  local  histograms.  For  an  M  by  M  image 
spread  out  over  N  PEs,  9  a  power  of  two,  the  number  of  parallel  additions  to 
compute  the  local  histograms  equals  the  number  of  pixels  in  each  PE,  which 
equals  The  number  of  addition  steps  is  therefore  reduced  by  a  factor 
of  N,  at  the  added  cost  of  9-1 ♦log, IN/3)  transfer/add  operations.  The 
result  of  the  algorithm,  i.e.,  the  histogram,  is  spread  out  over  the  first  B 
processors.  This  distribution  may  be  efficient  for  further  processing  on 
the  histogram,  e.g.,  finding  the  maximum  or  minimum,  or  for  smoothing  the 
histogram.  If  it  is  necessary  for  the  entire  histogram  to  be  in  a  single 
processor,  B-1  additional  parallel  data  transfers  are  required. 

C  .  Two-Dimensional  FFT 

As  an  example  of  the  applicability  of  parallel  computations  to  a  dif¬ 
ferent  type  of  image  processing  task,  the  Fast  Fourier  Transform  (FFT)  is 
considered.  A  parallel  algorithm  for  the  one-dimensional  FFT,  often  used  in 
speech  processing  T521,  is  presented  in  ^491.  Here  the  two-dimensional  FFT 
is  examined. 

The  two-dimens ional  discrete  Fourier  transform  (DFT)  of  an  L  by  9  array 
of  elements  S(l,m)  is  defined  as 
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1-1  H-1  . 

F(v,w>  *  £  E  S(l,n)  W,  u!T  0  <  v  <  L,  0  <  w  <  M 

1*0  *=9  L 

The  two-dimensional  transform  can  be  decomposed  in  such  a  way  as  to  reduce 
its  computation  to  the  execution  of  a  number  of  one-dimensional  DFTs.  Per¬ 
forming  the  DFT  on  each  row  of  the  array  yields 

*1-1 

G(l,w>  =  £  S(l,m)  w“™  0<1<L,  0<w<M 

m*0 

The  OFT  of  the  array  can  then  be  obtained  by  taking  the  OFT  of  each  column 
of  G: 

L-1 

F ( v,w)  *  £  G(l,w)  W,  0  <  v  <  L,  0  <  w  <  « 

1*0  L 

Thus  the  two-dimensional  transform  can  be  obtained  by  computing  L  one¬ 
dimensional  H-point  transforms  on  the  L  rows  of  the  S  array,  then  computing 
W  one-dimensionai  L-point  transforms  on  the  N  columns  of  the  G  array  result¬ 
ing  from  the  row  transforms  r27,111. 

Consider  performing  the  OFT  on  an  1  by  1  array  S.  For  example,  assume 

M  =  1024,  and  the  array  represents  a  1024  by  1024  point  picture.  Figure  21 

outlines  an  efficient  parallel  two-dimensional  FFT  algorithm  which  uses  N 
PEs.  It  is  assumed  that  initially  PC  i  contains  row  i  of  the  array  S.  The 
OFTs  on  the  rows  of  S  are  performed  by  executing  a  serial  FFT  in  each  PE,  on 
the  row  of  S  contained  in  that  PE.  This  serial  FFT  can  be  executed  simul¬ 
taneously  by  each  of  the  N  PEs.  The  resulting  G  array  has  row  i  in  PE  i. 

The  transpose  operation  is  performed  on  G,  to  rearrange  the  array  so  that  PE 
1  contains  column  i  of  G.  4  serial  FFT  executed  in  parallel  in  each  PE  per¬ 
forms  the  DFT  on  the  columns  of  G,  producing  the  transform  array  F. 
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Figure  22  presents  an  algorithm  to  transpose  the  array  C.  The  basic 
operation  performed  is  the  transfer  of  array  element  G(j,k)  from  PC  j  to  PE 
k.  This  is  done  for  M  G(j,k)'s  in  parallel,  using  an  interconnection  func¬ 
tion  which  sends  data  from  PE  j  to  PE  (jai)  mod  M  for  all  of  the  G(j,k)  for 
which  (k-j)  mod  ft  is  equal  to  i.  The  parallel  transfer  operation  is  per¬ 
formed  for  1  <  i  <  t.  (For  i=0,  no  transfer  is  needed;  i.e.,  the  diagonal 
of  the  transpose  matrix  is  the  same  as  the  diagonal  of  the  original  matrix.) 
For  each  i  value,  the  element  which  PE  j  sends  is  the  k*h  element  of  the  row 
of  G  held  in  PE  j,  where  k=(jai)  mod  M.  That  element,  received  in  PE  k,  is 
stored  as  the  element  of  the  column  of  G  transpose  being  created  in  PE 
k,  where  j=(k-i)  mod  *t.  In  the  algorithm,  it  is  assumed  that  each  PE  has  an 
address  register,  ADDRESS,  which  contains  the  integer  i  in  PE  i,  0  £  i  <  M. 
In  each  PE,  G  denotes  the  address  of  the  first  element  of  the  G  vector  (row) 
that  is  stored  in  that  PE;  GT  denotes  the  address  of  the  first  element  of 
the  G  transpose  vector  in  each  PE.  If  o  bytes  of  storage  are  needed  for 
each  array  element,  G(j,k)  is  initially  in  location  G*k*b  of  PE  j  ,  and 
after  execution  of  the  transpose  algorithm,  G(j,k)  is  in  location  GTtj<*b  of 
PE  k,  0  <  },k  <  *1.  Each  of  the  interprocessor  data  transfers  needed  for  the 
transpose  can  be  done  in  a  single  pass  through  either  a  multistage  Cube  or 
PMc’r  network. 

The  parallel  two-dimensional  FFT  algorithm  presented  requires  ftlog.,t 
parallel  complex  multiplications  and  ft-1  parallel  data  transfers  to  compute 
the  two-dimensional  DFT  of  an  ft  by  ft  array.  The  computation  of  the  intei — 
mediate  array  G  by  broadcasting  a  serial  FFT  algorithm  to  all  PEs,  where 
each  PE  has  a  different  row  of  S,  entails  (ft/?)  log.,*  parallel  complex  multi¬ 
plications.  Serial  computation  of  the  DFTs  on  the  ft  rows  of  S  would  require 
Ot^/2)  log.,''  complex  multiplications,  so  speedup  by  a  factor  of  t  is 
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achieved.  This  is  the  maximum  speedup  obtainable  with  W  PEs.  (Similarly, 
maximal  reduction  in  the  number  of  complex  additions  is  achieved.  In  gen¬ 
eral,  however,  the  multiplication  time  will  dominate  the  addition  time.) 
Computation  of  F  from  the  transpose  of  G  likewise  entails  (N/2)log_N  paral¬ 
lel  complex  multiplications.  *1-1  parallel  data  transfers  are  required  to 
transpose  the  G  array.  For  M  =  1')24,  a  total  of  10,240  parallel  complex 
multiplications  and  1023  parallel  transfers  will  be  executed.  Serial  imple¬ 
mentation  of  the  FFT  computation  would  require  l^log,*  complex  multiplica¬ 
tions,  which  for  *  =  1024  would  be  10,485,760.  In  a  system  such  as  PASH, 
the  FFT  algorithm  would  be  a  system  function,  callable  as  a  library  routine 
from  a  user's  program. 

VI.  CONCLUSIONS 

PASH,  a  large  scale  part  it  ionable  SIHD/HIHD  mul  t  im  i  coprocessor  system 
for  image  processing  and  pattern  recognition,  has  been  presented.  Its 
overall  architecture  was  described  and  examples  of  how  PASH  can  realize  sig¬ 
nificant  computational  improvements  over  conventional  systems  was  demon¬ 
strated.  Current  work  includes  specifying  the  hardware  design  details  and 
developing  the  operating  system  and  programming  languages  for  a  prototype 
system.  A  dynamically  reconf igurable  system  such  as  PASH  should  be  a  valu¬ 
able  tool  for  both  image  processing/pattern  recognition  and  parallel  pro¬ 
cessing  research. 
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Conceptual  view  o*  a  rec l rcul at  log  (single  stage)  network.  "IS"  Is 
Input  selector,  "OS"  Is  output  selector. 


Data  manipulator  network,  tor  N=3.  U  wean*  use  dashed  line  connec¬ 
tion,  D  means  use  solid  line  connection,  and  H  means  use  dotted 
line  connection.  Lower  case  letters  indicate  end-around  connec- 


MC 

*{ host 

SYSTt* 

SWITCH 


Fig.  10.  Piero  controller  Communications  Pus  (PCCB) 
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ig.  16.  Hardware  scheme  for  dynamically  altering  the  loading  sequence 
the  memory  modules. 


Fig.  17.  Distributed  memory  management  System 


PROCEDURE  picture 


/*  define  pixin  and  pixout  to  be  512x512 
arrays  of  unsigned  eight  bit  integers  */ 

UNSIGNED  BYTE  pi »  inC5123C51 21,  pi xout ^51 21C51 21; 
define  hist  to  be  a  256  word  array  of  integers  */ 

INTEGER  histC2561; 

/*  define  x  and  y  to  be  index  sets  */ 

INDEX  x,  y; 

/•  declare  pixin  to  be  loaded  by  input  data 
and  pixout  and  hist  to  be  unloaded  as 
output  data  *' 

DAT X  INPUT  pixin; 

DATA  OUTPUT  pixout,  hist; 

/*  define  the  sets  of  indices  which  x  and  y 

represent,  i.e.,  x  and  y  represent  the  integers 
between  1  ar.d  510  inclusive  */ 

x  =  y  *  (1  *  510); 

/*  compute  average  of  each  point  and  its 

eight  nearest  neighbors  (simultaneously  if  possible)  •/ 

pi  xout  Z  xl  Tyl  *  (pi  x  inT  x-1 1ry-1 1*pi  x  in!!  x-1  ICyl *pi  x  InT  x-1 1ry*1  )♦ 
pixinrx1ty11  +  pixinrx1Cy1*p1xinrx1Cy*11* 
pi x inr  x*1](y-1 1*pixinrx*1iry1»pixinCx*1iry+11)/9; 

/•  initialize  each  bin  to  zero  */ 

histTO.  2551  =*  0; 

/*  compute  histogram  */ 

hi st Tpi xout CxICyll  =  hi st Tpi xout T xl tyll  ♦  1; 

END  pictur* 


Mg.  18.  High  level  language  algorithm  for  smoothing  and  computing  histo 
gram.  Keywords  are  in  upper  case  for  each  of  identification,  how 
ever  in  actual  use  they  need  not  be. 
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Fig.  19.  Histogram  calculation  for  N  -  16  PEs,  B  *  U  bins. 

denotes  that  bins  w  through  i  of  the  partial  histogram 
PF  . 


/•algorithm  to  combine  "local"  histograms.  b'  =  log^B-1;  n=log^N; 

keep  is  the  inde*  of  the  first  bin  which  to  be  kept, 
send  is  the  inde*  of  the  first  bin  which  is  to  be 
sent  to  another  PE  */ 

keep  *  0; 

/♦form  histogram  for  each  group  of  9  PEs  •/ 
for  i  *  0  to  b’  do 

/•group  of  PEs  with  higher  addresses  prepares  to  send  first  half 
of  remaining  bins  to  group  of  PEs  with  lower  addresses  */ 

MASK  CXn-<b'"0"1lXb’'S 
send  ♦  keep; 

b».| 

keep  *  send  *2  ; 

SET  ICN  to  PE-2b' " 1 ; 

/•group  of  PEs  with  lower  addresses  prepares  to  send  second  half  of 
remaining  bins  to  group  of  PEs  with  higher  addresses.  */ 

MASK  Cx"-(b*-i)-10Xb’-S 

send  *  keep  ♦  2  ; 

SET  ICN  to  PE  ♦  2b’”i; 

b*  •  { 

/•transfer  2  bins,  add  received  data  to  kept  data  */ 

MASK  [Xnl 

b« 

OTRin  *  histrsend  •  send  *2  -11; 

TRANSFER; 

b*  “  1 

histCkeep  ♦  keep  ♦  2  -11  *  histTkeep  *  keep  ♦ 

2b’”1-11  ♦  DTRout ; 

/•Combine  N/B  partial  histograms  to  form  histogram  of  entire  image.  •/ 
for  i  *  0  to  log^N/9-1 

MASK  rxn-b’-’ixb,*uS 


,b'*Hi 

SET  ICN  to  PE  ♦  2  ; 

OTRin  »  histFkeepI; 

TRANSFER; 

MASK  rxn"b,”i0Xb’°*,l 

histfkeepl  *  histFkeepI  ♦  DTRout 


Fig.  20.  Algorithm  to  perform  the  0  bin  histogram  calculation  for  image 
spread  out  over  N  PEs. 


I 


Fig.  21.  Computation  of  two-dimensional  OFT  of  M  0/  1  array  S  using  M  P£s. 


/*  transpose  N  by  S  array  G,  placing  result  in  G^.  PE  k  receives  G(j,k) 
from  j  as  follows:  At  step  i,  G(j/k)'s  for  which  Ck-j)  mod  M  =  i  are 

transferred  from  PE  j  to  PE  k  */ 

/*  In  each  PE,  G  denotes  the  address  of  the  first  element  of  the  G  vector 
that  is  stored  in  that  PE;  GT  denotes  the  address  of  the  first  element  of 
the  G  transpose  vector  in  each  PC.  */ 

for  i  *  1  W-1  do 

/*  in  PE  ') ,  send  G(j,k),  k  =  j  ♦  i  */ 

DTRin  ♦  G  ♦  b  *  (ADDRESS  ♦  i)  mod  M; 

SET  ICN  to  PE  ♦  i; 

TRANSFER; 

/*  in  PE  k,  received  G(j,k),  j  =  (k-i)  mod  M  */ 

GT  ♦  o  *  (ADDRESS  -  i)  mod  *  *  DTRout ; 

/*  copy  diagonal  from  G  to  GT  */ 

GT  ♦  D  *  ADDRESS  ♦  G  ♦  b  *  ADDRESS 

Fig.  22.  Algorithm  to  transpose  an  N  by  M  array  using  fl  PEs,  where  initially 
PE  j  holds  row  j  of  the  array. 
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ABSTRACT 

PASM,  a  large-scale  mul t ^microprocessor  system  being  designed  at  Purdue 
University  for  image  processing  and  pattern  recognition,  is  described.  This 
system  can  be  dynamically  reconfigured  to  operate  as  one  or  more  independent 
SIMD  and/or  MIND  machines.  PASH  consists  of  a  Parallel  Computation  Unit 
which  contains  N  processors,  N  memories,  and  an  interconnect  ion  network; 
Micro  Controllers,  each  of  which  controls  N/Q  processors;  N/Q  parallel 
secondary  storage  devices;  a  distributed  Memory  Management  System;  and  a 
System  Control  Unit,  to  coordinate  the  other  system  components.  Possible 
values  for  N  and  Q  are  1024  and  16,  respect i vel y .  The  control  schemes,  in¬ 
terprocessor  common i cat  ions,  and  memory  management  in  PASH  are  explored. 
Examples  of  how  PASM  can  be  used  to  perform  image  processing  tasks  are 
given. 
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